Sometimes there’s no integration for the specific data that you want to show in Home Assistant. Fortunately, there is a HACS integration in Home Assistant that allows you to scrape websites and load data from those websites into custom sensors. In this Home Assistant tutorial, I show you as an example how to scrape the current energy prices from your energy provider’s website and the top three box-office movies from the IMDB site and how to show them on a dashboard in Home Assistant. Let’s do this!
โญโญโญ NOTE: โญโญโญ
This article accompanies a YouTube video. I wrote it for people who would rather read than watch a video. To keep doing this, I want to ask you to check out the video, leave a comment under the video, give the video a thumbs up, and subscribe to my YouTube channel. This means that the video is offered more often to new visitors so that they also stay informed of the latest Home Assistant tutorials.
Thank you for your support!
Ed
Introduction
Today I will show you how you can scrape websites to retrieve data and show this data in Home Assistant. In this tutorial I will show you two examples:
- In the first example, we’re going to scrape the actual energy prices from your energy provider’s website and show these on a dashboard
- And in the second example, we are going to scrape the top 3 box office movies from IMDB and show them in your dashboard.
This tutorial aims to give you an insight into how to set up a scraper yourself and for that I use these two examples. There are of course numerous use cases that you can think of. I hope that I can give you an impulse to design your own use cases with these two examples. Let me know in the comments which data you would like to get from websites for which there is no integration in Home Assistant yet.
Home Assistant has its own scrape integration, which works quite well, but I am going to use another scrape integration that you can install through HACS. If you did not install HACS yet, please do that first. I have a video that shows how you can do that. The link to that video is in the description below.
The HACS integration is called multi scraper and has a couple of advantages over the standard Home Assistant scrape integration. One of those is that you can schedule the scrape action to prevent you “hammer” the website that you scrape the data from and minimize the risk of getting banned by that site. Another advantage is that you can scrape multiple values from the same webpage in one action. This also prevents you will get banned from that site.
Install the Multi Scrape Integration
Let’s install the multi scraper integration first.
- Go to HACS
- Click on Integrations
- Click on Explore & Download repositories
- Search for Multiscrape
- Click on Multiscrape and click Download twice.
- Now your multiscrape integration is installed
This integration supports a lot of functionalities. I will just cover some of them to give you a head start if you want to start with scraping. But, check the documentation of this integration for all the possibilities. The link is in the description below.
Install a Text Editor
To create our scrape scripts, we have to edit some files in YAML. This might sound a bit scary for some of you, but it’s not so bad once you understand how it works. First, make sure that you’ve got a text editor installed in Home Assistant. This can be either File Editor or Studio Code Server. Both can be installed using the add-on store.
- Go to Settings, Add-Ons en click on Add-On Store
- Search for File Editor or Studio Code Server and install one of the two.
- I use Studio code server myself.
Create the Multisensor file
We are now going to create the file that will contain our scraper code. I will create a separate file for this and point to this file in the configuration.yaml. This way, the configuration.yaml will stay clean and all scrape code is consolidated in one file.
- Open Studio Code Server
- Open the configuration.yaml in Studio Code Server
- Add the following line: multiscrape: !include multiscrape.yaml and save your configuration.yaml
- Now, create a new file multiscrape.yaml.
How do Scrapers work?
Now that we’ve created these files, we can start creating our scrapers.
Basically, a website consists of HTML code, Cascading Style Sheets, and Javascript. A scraper searches for certain elements in HTML and CSS and can retrieve data from a webpage that is visible between those elements.
A very simple example would be as follows:
<html>
<body>
<h1>Subscribe to my channel</h1>
<p>The value is: 10</p>
</body>
</html>
A scraper can detect the h1 and the p tag and can retrieve the data within these tags.
For instance, if we want to retrieve the value between the h1 tags, it would look something like this:
sensor:
- unique_id: title
name: Title
select: "h1"
And if we want to retrieve the value between the p tags, it would look something like this:
sensor:
- unique_id: my_value
name: My Value
select: "p"
In this case, the sensor title with the value “Subscribe to my channel” and the sensor my_value with the value “The value is: 10” will be created in Home Assistant.
This is a super simple example and won’t be enough in most cases, but I hope that you get a little bit of an idea about how it works.
If we only want to get the value 10 in our sensor, we have to create a value_template in our scraper like so:
sensor:
- unique_id: my_value
name: My Value
select: "p"
value_template: '{{ value.split(":")[1] }}'
This value_template splits the string that is retrieved at the colon character and returns the value that is right of the colon. If we would want to retrieve the value left from the colon we would have to change the 1 to a 0.
I understand that this might be very challenging for some of you, but I hope you get a bit of an idea about the principles of scraping. It can be very complicated and time-consuming to create your own select and value templates, especially when you have to scrape based on cascading style sheets or CSS. So, take it slow and start with simple data that you want to scrape en go further from there.
Get the energy prices from my Energy Provider’s website
I am going to scrape the energy prices from my energy provider’s website. Now, this is an energy provider in the Netherlands, but you can follow the same procedure for your own energy provider. The principle of scraping is the same for every website.
The webpage that we are going to scrape is this webpage: https://www.eneco.nl/duurzame-energie/modelcontract/
We start by pointing to this resource like so:
- resource: https://www.eneco.nl/duurzame-energie/modelcontract/
Now, I want these prices every 8 hours, which is 28.800 seconds. This why we add the following line:
scan_interval: 28800
I need your help!
You will be doing me a huge favor if you subscribe to my channel if you haven’t already. And, you will help me a lot if you also give this video a thumbs up and leave a comment. This way, YouTube will present this video to new people, making the channel grow! In the video description, you will also find information about how you can sponsor me so that I can continue to make these tutorials for you.
Thank you!
Now we are going to define our sensors. Let’s start with the first price. As you can see, the energy prices are visible somewhere in the middle of the webpage. Now, I am showing you this in Chrome. If you use a different browser, the procedure might be a bit different, but if you use Chrome, click the first price with your right mouse button end select “Inspect”.
You see that the price is between the td tags, so the select part of our scraper should contain this td tag. But, there are more td tags on this page. How do we know what td tag we need exactly?
Well, the td tag is inside another tag, namely the tr tag and the tr tag is within the tbody tag. So, we can start by pointing our scraper to the tbody tag like so:
select: "tbody
Within tbody we are going to point to the first tr tag. To point to the 1st, 2nd, 3rd etcetera tag, we use the function nth-child. So, for the first price, we point to the first occurrence of the tr tag like so:
select: "tbody > tr
And within the tr tag we want to point to the second td tag. So, we add the second td tag to our select line like so:
select: "tbody > tr > td:nth-child(2)"
The general idea behind this is, that you define the tags after each other in the same way as they occur in the source of the webpage.
Now, we want to retrieve the number of the price only and not the currency sign which is a string and not a number. Let’s filter this number using a value_template.
The value within the second td tag is a euro sign followed by the price. We only want to retrieve the price and can do this with a value_template like so:
value_template: '{{ value.split("โฌ")[1] }}'
What we are doing here is that we split the value at the euro sign and that we return everything in the value that is right of the euro sign. This is done by entering a 1 between square brackets. If we enter a 0 between square brackets, it would return the part that is left of the euro sign.
So, our complete scraper code looks like this:
- resource: https://www.eneco.nl/duurzame-energie/modelcontract/
scan_interval: 3600
sensor:
- unique_id: electricity_price_normal
name: Electricity price normal
select: "tbody > tr > td:nth-child(2)"
unit_of_measurement: "โฌ"
value_template: '{{ value.split("โฌ")[1] }}'
You see that I added some extra field here:
- The unique_id field which will be the entity ID of our sensor in Home Assistant
- The name which will be the friendly name in Home Assistant that you will see on dashboards
- And the unit_of_measurement, which will also be shown on the Home Assistant dashboards.
We do the same for our other prices. As you can see, the code for the other sensors looks a lot like the code of the first sensor. The only difference for each price is that we refer to another tr. So, the nth-child for the tr gets a higher number for each price. The td within the tr is always the second td, so the nth-child of the td is always 2.
Let’s test the Energy Price Sensors!
After we’ve defined all the sensors, we save this file and have to restart Home Assistant to bring these sensors to life. Go to Developer Tools and within the YAML tab, click on Check Configuration to see if your code is correct. Click on Restart if the code is correct.
After Home Assistant is restarted, go to the STATES tab and check the newly created sensors by clicking on Set State and check all your new entities.
This is awesome! You’ve now created your own sensors in Home Assistant and they get their values from a webpage. We can show these values on a dashboard now. I’ve created this dashboard that shows these values.
You can download the scraper and dashboard code that I’ve created for this video via the ko-fi link in the description below. Downloading is free, but if it is worth something to you, you can also enter an amount there to support my work. Please consider sponsoring me if my work saves you time. This way you support me so that I can continue to make these videos for you.
The Second Use Case: Retrieve the Top Three Box Office Movies from IMDB
Let’s head over to the second use case. This one is a bit more complicated to show you how you can create scrapers using CSS elements. In this use case, I want to retrieve the top three Box Office movies of last weekend from IMDB.
The Box Office movies are visible on this page: https://m.imdb.com/chart/boxoffice/. I don’t really have to refresh these values often, but let’s say that we will scrape this page every 24 hours.
We will give the sensor the id top_box_office_1 and the name Top Box Office 1
So, the first part of our new sensor looks like this:
- resource: https://m.imdb.com/chart/boxoffice/
scan_interval: 86400
sensor:
- unique_id: top_box_office_1
name: Top Box Office 1
Now, there are multiple ways to retrieve the first movie including the revenue for that movie. I will use one way here that I believe is the easiest to understand.
What we see is that the movies are shown in two columns. Each movie has a Title, a Weekend gross, a Total gross, and Weeks since release. We are going to create a sensor for each movie that will return the title as its state and will have attributes for the Weekend Gross, Total Gross, and Weeks Since Release.
Right click on the first movie title and select Inspect.
The structure of the CSS looks like this:
- chart-content
- chart-row
- col-md-6
- media
- btn-full
- media-body
- h4
- media-body
- btn-full
- media
- col-md-6
- chart-row
The title of the movie is in the h4 tag. So, we add the Select for the first movie like so:
- resource: https://m.imdb.com/chart/boxoffice/
scan_interval: 86400
sensor:
- unique_id: top_box_office_1
name: Top Box Office 1
select: "#chart-content > .chart-row > .col-md-6 > .media > .btn-full > .media-body > h4"
What you see is that I follow the CSS structure in the select, so that it ends at the point where the data is actually shown.
Get the other values for Top Box Office 1
In this use case, we also want to retrieve three extra values, namely the Weekend Gross, Total Gross, and Weeks Since Release. We can create other Home Assistant entities for this, but I think it’s nicer to store these values in attributes of the Top Box Office 1 entity that we already created.
The Weekend Gross can be found in the following CSS stucture:
- chart-content
- chart-row
- col-md-6
- media
- btn-full
- media-body
- p
- media-body
- btn-full
- media
- col-md-6
- chart-row
Also, when we retrieve the value that is within the paragraph (or p) tag, we will get all three values at once. So, we have to use a value_template once more to only retrieve the Weekend Gross number.
Now, we add an attribute like so:
- resource: https://m.imdb.com/chart/boxoffice/
scan_interval: 86400
sensor:
- unique_id: top_box_office_1
name: Top Box Office 1
select: "#chart-content > .chart-row > .col-md-6 > .media > .btn-full > .media-body > h4"
attributes:
- name: Weekend Gross
select: "#chart-content > .chart-row > .col-md-6 > .media > .btn-full > .media-body > p"
value_template: '{{ value.split(":")[1].split("T")[0] }}'
The attribute is Called Weekend Gross and the select is once again following the structure of the CSS tags. The value template splits the retrieved value at the colon, returns the right part of it, and then splits that result again at the T (from Total Gross) and returns the left part of that. So, it will be filled with the currency symbol, the number, and the M in this case.
You might have to rewind this part of the video a couple of times to understand fully what is happening here, but I’m sure you will get the idea eventually.
We do the same for the other two attributes. In the end, our scraper for Top Box Office 1 looks like this:
- resource: https://m.imdb.com/chart/boxoffice/
scan_interval: 86400
sensor:
- unique_id: top_box_office_1
name: Top Box Office 1
select: "#chart-content > .chart-row > .col-md-6 > .media > .btn-full > .media-body > h4"
attributes:
- name: Weekend Gross
select: "#chart-content > .chart-row > .col-md-6 > .media > .btn-full > .media-body > p"
value_template: '{{ value.split(":")[1].split("T")[0] }}'
- name: Total Gross
select: "#chart-content > .chart-row > .col-md-6 > .media > .btn-full > .media-body > p"
value_template: '{{ value.split(":")[2].split("W")[0] }}'
- name: Weeks Since Release
select: "#chart-content > .chart-row > .col-md-6 > .media > .btn-full > .media-body > p"
value_template: '{{ value.split(":")[3] }}'
Now that we’ve created the scraper for Top Box Office 1, we do the same for Top Box Office 2 and Top Box Office 3.
You can download the code via the download link in the description below.
Save the file after you’ve added all the scraper sensors and restart Home Assistant to make them come alive.
You can check your newly created sensors in the Developer tools.
- Go to Developer Tools
- Go to the states Tab
- Click on Set State
- and retrieve the entity Top Box Office 1, 2 and 3. You will see that the state is filled with the name of the movie and that the attributes are filled with weekend gross, total gross, and weeks since release.
The Dashboards
I created two dashboards that show the values of these scrapers. The first dashboard shows the Energy prices as you can see here. The second dashboard shows the Box Office Movies as you can see here. The code for these dashboards is in the download file which you can find the link in the description below this video.
Be aware that web pages may change and that your scrapers might fail over time. In that case, you have to alter your scraper so that it still works for the changed web page.
I hope that I managed to explain how scrapers work and that I gave you a head start to set up your own scrapers. Let me know in the comments what use cases you have for scrapers. I want to thank everyone who has supported me in making these videos and tutorials so far. I could never have done this without you. Thank you! You can support me through Patreon, Ko-Fi, or by joining my channel. If you also want to support me, look in the description ofย this videoย for the links. With that, you make it possible that I can continue to make these videos for you.
Oh, donโt forget to giveย this videoย a thumbs up, subscribe to my channel, and hit the notification bell.
I will see you soon!
Bye bye!