Build a Web Scraper (super simple!)
- January 12, 2024
- Posted by: MainInstructor
- Category: BASIC Go JavaScript Node
Video Title: Build a Web Scraper (super simple!)
Hello friends on the internet today today i want to show you how easy it is to build a web scraper using under 20 lines of code but not only that show you how to adapt the web scraper in order to scrape whatever you need from a web page
I will building this project using javascript node.js express as well as two packages called axios and cheerio i will be doing this with a beginner’s mindset in mind so if you don’t know anything about node or express please do not be worried i will be taking you through everything step by step and
Explaining everything we are doing along the way my aim for this video is to make it as accessible to as many of you as possible a basic understanding of javascript is advised but not a hard prerequisite as i am giving you my full permission to take
The 20 lines of code so just copy and paste them and use them as you wish after of course understanding what the code does by watching this tutorial but before we get started what exactly is web scraping and what is it useful web scraping refers to the extraction of
Data from a website quickly and accurately imagine for example you are working at a company that has asked you to make a list of all the companies working at a particular trade show and not only that but their contact name and email addresses well most people would
Probably open up the website of the trade show and start writing down the first company starting at a then the name and then the email associated with that company and then move on to the next one and so on and so on and it could literally take you days to get all
The details that you need and most likely some spelling mistakes would be made with web scraping you can have all that information in seconds many people move on to selling their web scraping tools for money either by building them as a chrome extension or api or selling
Them to data capturing companies so the option to make money off this tool is there for you too okay so now that we understand what a web scraper is and what it can be used for it’s time to get building one so here we are i’m just going to create
A blank project using webstorm please feel free to use whatever code editor or ide you wish and just create an empty directory so i’m just going to go ahead and click here and just call this web scraper just like so so that we can start completely from scratch so as you can
See here is my directory there are currently no files in it before we get going i just want to make sure that everyone watching has node.js installed on their machines node.js is essentially a open source server environment and we will be using it to create our own
Server or in other words our own backend it’s free and allows us to use the javascript language in order to create it so i am a big fan so i’m just going to head over to node.js now i am using a mac so i would of course click here in order to download
This onto my computer however here are all the other options you have for installing the source code so please go ahead and choose the one that you need now i already have this download so i’m not going to go ahead and click here but please go ahead and click whichever
Version or option is required for you okay great now let’s carry on so back in our projects it’s time to get coding the first thing i’m going to do is just open up my terminal right here and i’m going to type a command the command is npm init okay
This will trigger initialization and spin up a package json file we are creating a package.json file so that we can install packages or modules into our project to use if you want to have a look at all the packages that are available to us please go ahead and visit npmjs.com
So here are all the packages available to our disposal if you go ahead and just type one axios and click it you get all the information on how to install it as well as how many weekly downloads it gets so there we go you can literally search through all the packages that are
Available to you right here on this registry as a general rule any project that uses node.js as we will be using will need to have a package.json file so let’s go ahead and create one so i’m just going to go ahead and type enter and these prompts will be shown
Now i’m just going to go through and go enter version 1 enter is fine description i’m going to leave blank entry point is in dates.js that is fine and then i’m just going to leave all these blank like so and click ok so there we have it now if we go into
Here you will see that a package.json file has been generated for us based on the commands that we just had so once again here was our web scraper the version is one because this is the first version of the app that we are building the description we left blank
And the main file that we are going to be reading is index.js so let’s go ahead and create that index.js file i’m just going to go ahead and create it like so and there we go the package.json file there’s actually a lot more than just hold our packages and
The versions of them that we need so if you’d like to know more about it please pause here and google beginner’s guide using npm but for now let’s carry on so wonderful now that we have that let’s get to installing some packages the first packages that we are going to need
Is a package called express express is essentially a back-end framework for node.js okay we’re going to install it in order to listen to paths and listen out to our port to make sure that everything is working okay what i mean by this is that if we visit
A certain path or url it will execute some code and it will listen out to the port that we define but enough talking let me show you how so as i said the package that we need is called express so i’m just going to show you it on here
Let’s search for the package express and it will give us the instructions on how to install it so i’m just going to copy that and go back to my project and whack the command in here so npmi i is essentially for install it’s a shorthand and i’m going to click
Enter and wait for that to install as a dependency to my project so that is now done and you should suddenly see a dependency show up here and there we go so express is our first dependency and it has shown up here with a version now what is quite important for you to
Know is that if this project is not working you for any reason it could be it doesn’t have to be but it could be because of the version so if that is the case make sure to delete whatever’s in here and write the version that i am
Using and just install the package again by running npm i for short okay so that will reinstall the package and will generate a package lock json file so as you can see here this file has been generated since we installed the dependency and if we look here we will find the express package
So i’m just going to find that in here by typing express and there we go so you will see the version as well as which registry it has been installed from wonderful another reason that this project could not be working is that the node version that you install could be uncompatible
To check your node version all you have to do so i’m just going to press command k to clear this down here all you would have to do is type node v to check the version and make sure that it’s the same as mine now if you want to change the package
You can do so it will require some extra configuration and you can use the nvm command to essentially install different packages so i’m going to show you how to do this this might not work for you if you haven’t configured your computer correctly but essentially you can install a certain package onto your
Computer so i can install version 0.10 31 for example and click enter so now i’m essentially installing this version as well as having this version okay and once that has done loaded i’m going to show you how to use that version so let’s just wait for that to finish
And i can use that version by typing any vm use and then this package right here even though as default it has now switched to this version so instead i’m going to use this version and vm use to switch back to using the node version that we installed and
There we go we are now using node version 14.7.6 wonderful so those are two reasons that your project might not work if you are watching this in the future perhaps there’s been newer versions of express or newer versions of know that have come out that has made something brick so
That is just something you need to know that is a bit of knowledge because that is not only applicable to this project but in general is applicable to many projects that you will come across as a developer okay so we now have the package express as a reminder the express package is a
Back-end framework for node.js okay now another package that we need to use i’m just going to clear this again is a package called cheerio so once again i’m just going to go here and search for the package cheerio and there we go cheerio is a package that we will be
Using to essentially pick out html elements on a web page it works by passing markup and provides an api for traversing and manipulating the resulting data structure cheerio’s selector implementation is nearly identical to jquery so if you know jquery this might be familiar to you
So now that we know what we will be using this for let’s get to using it to pick our elements from a web page okay and we’re going to be doing that from this webpage right here so let’s go ahead and install it i’m simply going to copy this
And in webstorm just install the package cheerio just like we did with express and once again it should appear in our dependencies right here so here we go there is cheerio and the version of cheerio that we installed wonderful we have one more package to install and that is axios
So once again let’s go in here and find axios axios is a promise based http client for the browser and node.js axios essentially makes it easy to send http requests to rest endpoints and perform crud operations this means that we can use it to get post put and delete
Data it is a very popular package and one that i use quite a lot as a developer on a day-to-day basis so once again let’s install it i’m going to show you how to use it in a bit so once again i’m just going to put that
In here and wait for that to install as a dependency okay wonderful so there we have it there we have all three of the packages that we’re going to need for this project now that we have that i’m just going to do one more thing and that is write a
Script so to write a just gonna get rid of that one because we’re not gonna need it i’m gonna write a start script so that if i use the command npm run and then start as that is what you have called the script i’m going to essentially
I’m on index.js listen out to changes on the index.js file so that is what no demand does it listens out for any changes made to our index.js file so that is now done for the setup for our package.json file please feel free to take this from the code that i have
Shared with you in the source code hopefully you understand what all of this means for now and exactly what we need to get going so now let’s head over to our index.js file the first thing that i’m going to do is actually use all the packages that we
Have just installed so if we go to the documentation you will see that the first thing we need to do in order to use these packages is to require them in the index.js file so i’m just going to copy that line and in here i’m just going to paste the line like so
And i’m actually going to do it for all the packages so we’ve got axios we also have cheerio and the packages again called cheerio and then we also have the package express so there we go there’s all three of our packages that we need now the next thing that i’m going to do
Is actually initialize express so to do this i’m actually going to get express so what i’m doing here is essentially getting the package and getting all this wonderfulness everything that comes with and storing is express but we need to actually call express in order to release all this
Wonderfulness so i can do so by grabbing express and calling it and now that we have called it let’s say that something else i’m going to call it as const app you can call it whatever you wish so express essentially comes with great stuff like use get or listen
And because we’ve saved it all under app i’m going to use app listen to listen out to a port so listen out to the port that we decide let’s decide that our port is going to be const port 8 000. so we are saying that
We want to listen out to port 8000 to see if any changes are made and essentially we want our server to run on port 8000. again this can be whatever port you wish that is totally up to you so i’m going to listen out to port 8000
Uh what the syntax for this looks like is like this support listen and then i’m going to pass through a callback and i’m just going to say so if this is working i want it to say server running because this is my server on port and then pass through
Whatever port we defined up here so this is looking good server running on port let’s get to starting our app to see if this has worked so all i’m going to do is use this script and this script is npm run and then i’ve chosen to call it start so there we go
And wonderful our server is indeed running on port 8000 and that will essentially listen out for any changes we made to this file so if i make a change to this file let’s just go ahead and call this bob and call this bob for example and click save
It will restart due to changes and start again on by running node index js okay and then we get the message server running on port 8000 so let’s change that back to app just to make things more readable and carry on so great that is step one now step two
Let’s get to actually doing some scraping so to do this i am gonna start using some packages and the first packages i’m going to use is axios okay and axios works by passing through a url and it visits the url and then i get the response from it and in this case i’m
Going to get the response data and save it as some html that we can work with so in this case let’s pass through the url that we want to work with so we know that this is the guardian so i’m just going to copy that and i’m just going to paste it in
Here like so we can of course make this much more readable so i’m just going to save this as a url as i don’t plan on it changing and save this string and then just pass through the url just like so okay so now that we’ve passed through
That url i’m going to do some chaining if you don’t know much about chaming i do have an asynchronous javascript miniseries that i really do recommend you watching uh for now just please carry along curling with me anyway so this will return a promise and once that
Promise has resolved then we get the response of whatever’s come back so response and then well we’re going to get the response data and let’s save this as html okay so you can call this whatever you wish now if i console log html and i am just going to click save
You will see all this html come back to me this is essentially the html that is from the guardian home page okay you will see it here guardian all guardian related stuff so this is great but how do we start picking out certain elements okay like what if i
Want to pick up this button for example well we do so with cheerio so let’s go ahead and do that i’m just going to delete this for now and i’m going to use cheerio so the package we just installed and it comes with something called load that will allow us
To pass through the html so all of this and then we’re gonna save it as let’s just do a dollar sign okay so there we go so now whenever we use the dollar sign we’re essentially using all of this html and now i can essentially
Find so i’m going to use the dollar sign and i can essentially look through all of the html element and look for something with the let’s go ahead and see what we want to pick out so i’m just going to inspect this page if we want to pick out for example all the
Titles in here so i can do so i can pick out each of the articles title and perhaps the url that comes with them i could look for let’s go ahead and inspect something which inspect this one we could look for something that has the uh cfc maybe not this one
Maybe let’s make it bigger to have a better view of what we can and can’t use so for example if we inspect this h3 tag right here we can see that it has the class of fc item title so let’s go ahead and use that because in it we also see
That this has an a tag with an href which is a url so i’m just going to copy this as the class name that we want to look out for so here we go and i’m just going to paste it like so making sure to put a dot in front of it
As we are looking for a class name so that is what we are looking for in the html so don’t forget to put that that is the syntax that you need and for each item that you find like this well what do i want to happen let’s write a
Function so this is a callback function and for each item that we find that has the class fc item title i want to get that item so this is the syntax for doing so this i want to grab its text so we know this is an h3 tag so it will
Have some text if you want to have a look here there is some text in here so if we look in here there we go there is some text and that is what we are grabbing essentially and i also want to grab the h ref so i can do so once again
By grabbing so this and getting the attribute of h ref that exists inside it if i want to be more precise and i think that might be a good thing to do i can also find the a tag that exists in that item and then get the attribute of href from it
Okay so there we go that is the syntax for doing so let’s go ahead and save this as title and let’s save this as the url that we are looking for and there we go so for each element that we are finding we’re getting a title we’re getting
Something that is the url and now i’m actually going to create an array so where shall we create this array let’s go ahead and just create it up here so i’m just going to do it here const articles and an empty array now for each item
That we create i i want to get a title i want to get this url and i’m going to get to the articles array which is currently empty and use a javascript method called push to push something into it and i’m going to create an object and this object is going to have
The title that we just picked out and the url okay so that’s all we really need to do the next thing i’m going to do just to show you this is working is just console log and then uh console log out the articles just like so and just for good measure
We’re gonna catch any errors so this is how you catch errors i’m just gonna catch uh the errors so catch error console log error okay great so now let’s check it out i’m just going to save that and let’s see what comes back there we go so we are indeed getting the
Array that is coming back we have literally scraped the webpage and we are getting back so here is the results of our scrape we are getting back the title and the url of all the articles that exist on the guardian homepage okay and there is a lot so there we go
We have now successfully scraped a webpage and that’s really all there is to it so hopefully that was easy enough again if you want to just take this code so let’s maybe make a bit smaller this is all it is these are all the lines that you need
Along with the setup you can of course adjust this to scrape whatever you wish so as long as you know what you’re looking for on the web page you can pick out the sun elements you can search for a times you can search for h3 tags you
Can search for things by class name it is completely up to you so hopefully this has helped you in creating your own web scraping app please do hit me up if you have any questions or if you just want to chat do so in the description below thanks very much
-
Sale!
Wireless WIFI Repeater Extender Amplifier Booster 300Mbps
$29.99$14.99 Add to cartWireless WIFI Repeater Extender Amplifier Booster 300Mbps
Categories: Electronics, Wi-Fi Router, Wireless Wi-Fi Extender Tags: 300Mbps, 802.11N, Amplifier, Booster, Extender, mobile wi-fi booster, Remote, WIFI, Wireless, Wireless WIFI, Wireless WIFI Repeater, Wireless WIFI Repeater Extender, Wireless WIFI Repeater Extender Amplifier, Wireless WIFI Repeater Extender Amplifier Booster, Wireless WIFI Repeater Extender Amplifier Booster 300Mbps$29.99$14.99 -
Sale!
Full RGB Light Design Gaming Headset Headphones with Mic
$24.99$14.99 Add to cartFull RGB Light Design Gaming Headset Headphones with Mic
Categories: Electronics, Gaming, Gaming Headsets Tags: Design, Full, Full RGB Light Design Gaming Headset, Full RGB Light Design Gaming Headset Headphones, Full RGB Light Design Gaming Headset Headphones with Mic, Gamer, Gaming, Gaming Headset Headphones, gaming headset wireless, Headphone, Headphones, Headset, Light, Mic, Package, RGB$24.99$14.99 -
Sale!
Wireless BlueTooth Multi-Device Keyboard Mouse Combo
$39.99$19.99 Add to cartWireless BlueTooth Multi-Device Keyboard Mouse Combo
Categories: Electronics, Gaming, Gaming Keyboards, Keyboard Mouse Combos Tags: Combo, Keyboard, keyboard mouse combos, Mouse, MultiDevice, Set, WireKeyboard Mouse Combo, Wireless, Wireless BlueTooth Keyboard Mouse Combo, Wireless BlueTooth Keyboard Mouse Combos, Wireless BlueTooth Multi-Device Keyboard Mouse Combo, Wireless BlueTooth Multi-Device Keyboard Mouse Combos$39.99$19.99 -
Sale!
High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
$199.99$139.99 Add to cartHigh Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
Categories: Gaming, Gaming Chairs Tags: Adjustable, Chair, computer chairs, Desk, Executive, Gaming, Girl, Headrest, High, High Back Leather Executive Adjustable Swivel Gaming Chair, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar, High Back Leather Executive Adjustable Swivel Gaming Chairs, Leather, Lumbar, Office, Racing, Swivel$199.99$139.99 -
Sale!
Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
$29.99$19.99 Select optionsProfessional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
SKU: N/A Categories: Electronics, Gaming, Gaming Headsets Tags: Cancelling, Gaming, Gaming Headphones with Noise Cancelling Microphone, gaming headset, Headphones, Headset, LED, Light, Mic, Microphone, Noise, Professional, Professional LED Light Wired Gaming Headphones, Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone, Wired, Wired Gaming Headphones, Wired Gaming Headphones with Noise Cancelling Microphone$29.99$19.99 -
Sale!
Gaming Desk with LED Lights USB Power Outlets and Charging Ports
$349.99$249.99 Select optionsGaming Desk with LED Lights USB Power Outlets and Charging Ports
SKU: N/A Categories: Computer Desk, Gaming, Gaming Desk Tags: and Charging Ports, Charging, Desk, Desks, Gaming, gaming desk with led lights, Gaming Desks with LED Lights, Home, LED, Lights, Monitor, Office, Outlets, Port, Power, Room, Stand, USB, USB Power Outlets, White, Workstation$349.99$249.99 -
Sale!
Wired Mixed Backlit Anti-Ghosting Gaming Keyboard
$99.99$79.99 Add to cartWired Mixed Backlit Anti-Ghosting Gaming Keyboard
Categories: Electronics, Gaming, Gaming Keyboards Tags: Antighosting, Backlit, Blue, brown, Gaming, Gaming Keyboard, gaming keyboards, gaming keyboards and mouse, Keyboard, Laptop, Switch, Wired, Wired Mixed Backlit Anti-Ghosting Gaming Keyboard, Wired Mixed Backlit Anti-Ghosting Gaming Keyboards, Wired Mixed Backlit Gaming Keyboard$99.99$79.99 -
Sale!
Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
$119.99$59.99 Add to cartWireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
Categories: Electronics, Gaming, Gaming Headsets Tags: 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, ANC, Audio, Bluetooth, Cancellation, Ear, Earphone, gaming headset, Headphones, Headset, Hi-Res Over the Ear Headphones Headset, HiRes, Noise, Wireless, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Headphones, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headsets$119.99$59.99 -
Sale!
Wired Sports Gaming Headset Earbuds with Microphone
$19.99$9.99 Select optionsWired Sports Gaming Headset Earbuds with Microphone
SKU: N/A Categories: Gaming, Gaming Headsets Tags: Accessories, Earbud, Earphone, Earphones, Gaming, gaming headset with microphone, Headphones, Headset, IOS, Microphone, Sports, Wired, Wired Sports Gaming Headset Earbuds, Wired Sports Gaming Headset Earbuds with Microphone, Wired Sports Headset Earbuds$19.99$9.99 -
Sale!
150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
$49.99$29.99 Add to cart150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
Categories: Charging Stations, Electronics Tags: 150W, 150W Charging Station, 150W Universal Multi USB Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Stations, 150W Universal Multi USB MAX Charging Station, 16 Port MAX Charging Station, 3.5A, Charger, Charging, Fast, laptop charging stations, Max, Multi, Port, Stand, Station, Universal, USB$49.99$29.99
thanks Ania its working perfectly, may i ask why we cant use an arrow function inside the each on line 15 of the code when you call the cheerio ($) function i try it and i got all undefine but i cannot wrap my head around the why….
sister used 70% of the vid to set up everything huhu.
anyways, it works at the end, and its what we came here for. Thanks!
thank you it worked but some of the class names and format changed on the example site used so the data came back slightly different
And how do you get the data from angular based websites, the text show {{some_variable}} instead
Hello>?
how much time am i gonna waste watching these videos
cant even do more than one or two of the games christ
what to do if the response is 403
just what i needed i successfully got src but was stuck at chaining thank you
excellent, complete and simple web scraper, I design a template in 4 minutes to scrape title and url using a tool based on puppeteer: https://youtu.be/rB5BHg0XyKs
So I got to the video (14.59) typed in npm run start. I hit enter and got "sh: nodemon: command not found”. Everything work as you explained up to that part, Is there something additional I must do for the "npm run start” to work?
She's look like Danarys Storm Born
Great tutorial. Very well spoken. Very well communicated. Great Job Ania.
Using the old fashioned way of xmlhttp and a little regular expression magic I can scrape and parse any content type with javascript and even trigger timeout and status error handlers before outputting data to a document object. In php I Use cUrl and preg_match for the same thing. No extra packages to install. It may take a few more lines of code but is much more efficient without all the dependencies.
Right away following the instructions I get this: 'npm' is not recognized as an internal or external command, operable program or batch file. I checked and node.js is installed. I guess that ends this lesson.
Nice tutorial, but there are AI tools now like Kadoa that can do all of this for you. In the time it takes for you to watch this video, you can get an AI scraper up and running.
🎯 Key Takeaways for quick navigation:
Made with HARPA AI
To anyone coming here now, It works but you need to change the ".fc-…" to ".dcr-12ilguo". The website has changed.
What about websites involving search button? So we can get the best deals or something out of that search query…??
hello from mongolia 'ulaanbaatar ' 'small town apartment'
holy fuck, you are good
great workshop but it won't work anymore… the article class name has changed to ".dcr-12ilguo" as of August 23….
Awesome Video.
Clear Precise and great Audio, easy to follow and listen to.
Thank you so much!
Bro I'm going to make so many bots with this =D
thanks, Ania it is very good and working perfectly.
Thank you, Ania! It worked perfectly. I had no idea how to complete this task. You saved my day and gave me a lot of knowledge and fun too. I send you love from Venezuela, you are a genius! ♥
This is not a scraper, you can just do what she did by viewing source on a web page. In your browser. Proper scraping is totally different .
helpful video, thank you
Is there a way to scrape the entire url and export the entire webpage as a file?
you married to James Gunn?
In my project the url does not come as an actually url like yours.
Interesting, whenever I think of scraper I always think of Python (beautifulsoup).
This is really a very good perspective.
thanks for sharing.