All Pro Web Designs > > Learning Tutorials > Programming Languages > BASIC > Python Tutorial: Web Scraping with BeautifulSoup and Requests

Python Tutorial: Web Scraping with BeautifulSoup and Requests

March 25, 2024
Posted by: MainInstructor
Category: BASIC Go JavaScript Python

33 Comments

Video Title: Python Tutorial: Web Scraping with BeautifulSoup and Requests

Hey there how’s it going everybody in this video we’ll be learning how to scrape websites using the beautifulsoup library now if you don’t know what it means to scrape websites basically this means parsing the content from my website and pulling out exactly the information that you want so for example

Maybe you want to pull down some headlines from my new site or grab some scores from a sports website or monitor the prices of some items in an online store or something like that now to show an example of this let’s take a look at the finished product that we’ll be

Building in this video and then we will learn how to build it so I’m here on my personal website and on my home page here I have a lot of different posts of my most recent videos and every post that I have has a title here that is a

Big heading tag and then I have a text summary of the video here and then I have a link to the video so let’s say that we wanted to write a scraper that would go out and scrape all of this information so I wanted to grab all of

The post titles and summaries and links to the videos from my home page and I wanted to ignore all this other information so to do this I have a finished version of what we’ll be building in this video and then we’ll learn how to build it so right now I can

Just run this with Python and this is called CMS scrape pie so if I run this then this went out and scraped all of the titles and summaries and links so we can see here we have a title so this is my CSV module video and then we have the

Summary text here and then we have the link text here now not only did this go out and scrape this information from the website and print it out here in the terminal but it also created a CSV of all this information as well so if I

Open up this CMS scrape dot CSV this should open it up in numbers but you could also open it up in Excel now this isn’t very readable right now but if I make these columns a little bit smaller here and then wrap this text then we should be able to read this so

You can see we have a column that has all of our headlines for all the articles on that homepage and then all the text summaries and then a link to each video so that is what web scraping is it goes out and it pulls down all of the

Information that you want from a specific website so now I am going to clear those out and pull these back up now if you’re trying to parse out that information with something that you had built in Python yourself then you probably run into a lot of issues but luckily there’s the beautiful soup

Library that makes parsing out all this information a lot easier to do now we’ll also be using the request library in this video to make our web request now you could use the built-in URL Lib module but the request library is extremely popular for fetching websites

So we’re gonna go ahead and use that so let’s go ahead and get started and see how to do this so first of all let’s make sure that we have everything installed that we need so to install beautifulsoup you can just use the pip install command so to do

This we can just say pip install and this is beautiful soup and this is beautiful soup for so you can see that I already had that installed but if you don’t have that installed then your should just go through the installation at that point now you definitely want to

Install beautiful soup 4 because there is an older version just called beautiful soup but beautiful soup 4 will give you one that’s most up-to-date so once that’s installed then we need to make sure that we have a parser to parse our HTML now I won’t go deep into the

Details of these parsers but there are some small differences between the parsers and they could return different results depending on the HTML that you’re trying to parse now if you’re trying to parse perfectly formed HTML then those differences aren’t going to matter but if there are mistakes in the

HTML then the different parsers will try to fill in missing information differently so beautiful soup has a section in their documentation about the differences between those parsers and basically they suggest installing and using the L XML parser so that’s what we’re going to use in this video now

They also say that the html5 Lib parser uses techniques that are part of the html5 standard so you could use that one too but most of the time the choice between the parsers isn’t really going to matter all that much as long as you’re working with good HTML but I’ll go ahead and

Leave a link to the differences between those parsers and the description section below if you want to read more about those so to make sure that we have the L XML parser installed we can install it with pip also so we could just say pip install

And that is El XML so if we run that then I already have that installed but yours will install there if you don’t already have that now if you want the html5 Lib parser then you can just do a pip install html5 Lib so like I said

We’ll be using El XML in this video but the html5 Lib is popular as well so now we also need the request library and just the same we can do a pip install request and run that you can see that mine’s already installed but if you don’t have it then yours should get

Pulled down right there okay so now that we have those installed let me clear that out and now let’s take a look at what we can use these for now you don’t have to be extremely familiar with HTML in order to scrape websites but it definitely helps to know so basically

HTML is structured in a way where all the information is contained within certain tags and if you’re at all familiar with XML then it’s very similar to that now I have a very extremely basic HTML file open here in my browser so we can see that this small example

Just has one big hit header here that says test website and then we have two large links here for articles and one is the article 1 headline and then it has a small text summary here below that and then we have a big article 2 headline

Here with a text summary below that and then we have a footer down here at the bottom now this is how browsers display HTML we are using the Chrome browser right now but in the background the source code looks a bit different so I have the source code for this very basic

Website pulled Oh pulled up over here on the right side of my screen so let me make this a little smaller here and then I will stretch this over so that we can better see the source code so we can see how this is structured so we have these

Tags throughout our document and there are opening tags that are surrounded by these angle brackets here so we have this head tag that opens the tag and they also have closing brackets down here which are the same except they have a forward slash after the first angle

Bracket so the close of our head tag will be this line here and everything all this content is within this head tag so all this here is a single head and all of these tags can be nested so if we want to find our article headline and article summaries then we can look

Down here in our body tag so we have an opening body tag here and within the body we have our you know test website h1 here which is a heading and then we have a div tag here which has a class of article and within this div we have our

H2 tag and h2 is another heading a subheading and within that h2 we have a link these a tags which are anchor tags these are links so this is the text to the link here article 1 headline that’s what gets displayed over here in the actual website we can see article 1

Headline and this href this is actually where this links to so this links to a page article 1 dot HTML now these classes here how this has a class of article these are mainly used for CSS styling and can also be used within JavaScript to identify specific elements

Now below that heading tag that we looked at then we just have a paragraph tag here which is just a P and this is the text summary of that article so we can see here that this is this entire div with the class of article has our h2

Heading and then our paragraph for the summary and then this is just repeated down here so for our second article we have another div with the class article and then another h2 but this one is for the article – head headline and the article – link and then the summary text

For article – and then lastly we have a footer down here at the bottom that is just a div with the class of footer and that has a paragraph tag within there with some text so everything else in here is just extra information so we

Have some scripts and up here at the top we have some style sheets and things like that but all of this and the body is what gets displayed over here in the website so let’s use this very simple example to see how we can parse out information using beautifulsoup

So I’m going to open up a file here called scrape pie now all we have in here so far are our imports for a beautiful soup and requests so we have from bs4 import beautiful soup and then we’re also importing requests so let’s say that we wanted to parse out the article headline

And the summaries from our very simple website over here so in this example it’s just article one and it’s summary text and then article two headline and it’s summary text so first things first let’s pass our HTML in the beautiful soup so that we can get a beautiful soup

Object now there are a couple ways to do this we can either pass in the HTML as a string which is what we’ll do in a minute when we parse our website from the internet but we can also pass in an HTML file and in our case we have this

Sample HTML file within our current directory so let’s go ahead and just open up this file and pass it in to beautifulsoup so to open up this file we can just say with open and this HTML file is called simple dot HTML and it’s within the same

Directory of our script so we don’t have to specify a path and then we’re just going to read that in so read is the default so we don’t have to do anything there and I’ll just say as HTML file and then the pass dot HTML file into beautifulsoup

We can just say soup equals beautiful soup and then we will pass in that HTML file and now we need to specify our parser and like I said for this video we are going to use the L XML parser now if working with files is new to you and you

Want to know more about this like with open statement and things like that then I do have a video specifically on working with file objects and I’ll leave a link to that in the description section below ok so now we have this soup variable which is a beautiful soup

Object of our parsed HTML so let’s just print this out and see what we get so we can just print out soup so if I save that and run it then let me make this a little bit bigger here so we can see that this just prints out all of the

HTML and so it’s very similar to what we just looked at now this HTML isn’t formatted in a very readable way it’s all you know pushed over to the left if we actually look at that simple dot HTML file we can see that it’s nice and

Indented so in order to format this to where we can more clearly see which tags are nested within each other then we can just use the prettify method to clean this up a bit so if we say soup dot prettify and that is a method so we have to put in

If we save that and run it now we can see that it in dents these two where we can see what tags are nested within each other so here is that head tag that we saw before and then everything that is indented within that head head tag is

Belongs to that head tag okay so now let’s see how to grab information from this HTML so the easiest way to get information from a tag is to just access it like an attribute so if we wanted to grab the title of our HTML page and if I

Look here at our HTML this should just be test a sample website is our title for this so the easiest way to get that is to just access it like an attribute so I will say match equals and we will do soup dot title and then we will just

Print out that match so I’ll save that and run it then we can see that it parsed out that title tag now it still has the title tags around the text so if we only wanted to grab the text of the title tag then we can access the text

Attribute of that tag so we can just add that to the end here so we’ll say dot title dot text so if I save that and run it then you can see that now we only get the text of that title tag now searching for a tag like we did here by accessing

It like an attribute by saying dot title that will get the first title tag on the page but the first tag on the page not all might not always be what we want so we can use the fine method to do something similar but it will also allow

Us to pass in some arguments that we can find the exact tag that we’re looking for so for example if I use this dot access to find the first div on the page and I do soup dot div if I save that and run it then we can see that it got the

First div tag on our page with all of its child tags which is everything for that first article but if we wanted to grab the div tag that has a class of footer for example then we’ll have to use that fine method and pass in some arguments so let’s use that fine method

So we’ll do soup dot find and now we will search for a div now if I save that and run that right there then we just get the same thing we just get the first div on the page but with this fine method we can pass in arguments of

Attributes that narrow down exactly what tag we want to find so for example I can pass in an argument of class and then after class we need an underscore class underscore equals footer now these arguments can match any attributes that your tag might have and most of the time

You can just pass in arguments just like they are in the HTML so if you wanted to match a div with an ID of footer then you could just pass in an argument of ID equals footer but the reason that we need an underscore after class is

Because class is a special keyword in Python so they use class underscore instead so if you were confused about that then that’s why they have that so if we save that and run it then we can see that now we’re not getting the first div on the page we’re actually getting

The div with the class of footer okay so now let’s say that we wanted to parse the HTML and get all the article headlines and summaries from our page now anytime that we want to get multiple things from a page a good way to start

Is to just get one of whatever it is that you’re trying to parse so for example if I wanted to get grab the headline and snip it from each article on our page over here then let me start by first grabbing that information for one article and once we have that

Working then we can apply the same logic to all of our articles so if we go back here to our browser and look at our page now in order to dig down into the HTML and find exactly where our article headline and summary is within the chrome browser we can just right click

On whatever it is that we want to parse and then click on inspect now I know this is a little small let me make this just a little bit bigger here and then we’ll walk through a little bit of how to use this so I’m using chrome here but

Pretty much every major browser anymore has something like this and this is really useful for finding exactly what you want so within the inspect here if I just hover over our div class of article then you can see that in the top part here it’s actually highlighting that

Entire everything that is within that div and if I go down to the h2 then it only highlights that h2 and then if I hover over the href then it highlights that link and if I hover over the paragraph it highlights that paragraph so we can see exactly what is what and

The same with the second article if I go down here and hover over this article that I can see that that has the article to headline and summary text and I can click on this little arrow here to expand this and then it shows me everything that is within that div there

So we have the h2 they ate the anchor tag and then the paragraph tag with the summary text so just like we saw before in the source code our article headlines are within a div with a class of article and then an h2 and then an anchor tag so

Let’s go ahead and grab the article div so let me make this little smaller here so that we can see this so let’s grab that first div with the class of article so I’m going to change this variable name here over to article and then print

That out now this is going to be a div with the class of article so if we save that and run it then we can see that now we have that first article and we can search that matched tag just like we searched the entire HTML document so we

Can access child tags with the dot access like an attribute or we can use the find method so for example if we wanted to dig down into the text of the headline then we could say headline is equal to and now we don’t want to use that entire soup which is the entire

HTML we only want to search within this article so now we’ll say article dot h2 and within that h2 we want to access the anchor tag so that is dot a and now we want the text of that acre tag so we can just string all of that together so with

All that strung together if I print out that headline and save that let me comment out this entire article for now so if I save that and run it then we can see that it grabbed the text of that first articles headline and we can do the same thing with the article summary

So it’s just a paragraph within our article so if we go down a couple lines here then we could say summary is equal to and that is article dot P so just to grab that paragraph dot text so if we print out that summary save that and run

It then we can see that now we have that article one headline text and then we have the text summary of that our as well okay so now we have the code here for grabbing a headline and a summary from a single article so now that we have this information for one

Article we can most likely use this reuse this information to parse the information from all of our articles so right now we’re using this fine method to just get the first article but now we need to loop through all of the articles so to get all of the articles we instead

Of using find we can just use the fine all method now with find all instead of just returning the first tag that matches these arguments it will instead return a list of all the tags that match those arguments so instead of just setting this variable we can now loop

Over the list that that returns so instead of saying article equals we can just create a for loop so we can say for article in soup dot find all since this returns a list so now we have a for loop there I’m just going to get rid of that

Print article line and then I’m going to put this logic here for grabbing the headline and summary from an article within that for loop and now it’ll loop through all the articles which in this case is just the two of them and we’ll get the information for both of those

And also let me put in one more blank print statement here within our loop so that at the end we have a blank line between our articles so if I save that and run it then now we can see that we have the article one headline and the

Summary for that article we also have the article two headline and the summary for that article okay so this is good so we’re starting to see how this would be useful for getting information from websites so now let’s do something similar but with an actual website so

Like we saw before I have my personal website pulled up here in the browser and like we saw if I scroll down then we can see that we have a lot of video headlines and summaries and the embedded videos that themselves so let’s say that we wanted to grab these titles and these

Summaries and links to the videos so first things first let me just delete what we had so far with our simple HTML file that we used and I’m also going to get rid of where we are in that file so first things first we want to get the source code from my

Website using the request library and to do this we can just say source equals request dot get and now we want to get my website which is just HTTP kori MS com now this request dot get will return a response object and to get the source code from that response object we can

Just add on dot text to the end so now this source variable should be equal to the HTML of my website so now we can pass this in to beautiful soup so now let’s see if that worked so if we print out soup dot prettify like we saw before

Then this should print out the formatted code for my website so it looked like that worked if I scroll up here we can see that this does look like HTML it’s kind of a mess because it’s a larger website but we can see that you know

These links seem to be coming from my website so it looked like that worked so now we can start parsing out the information that we want now just like before let’s start off by grabbing one videos information and then it will loop through to get the information for all

The videos so to grab the first headline and snippet for the first post on my page let’s inspect my website and see if we can figure out what the structure is so I’m going to make this a little larger here and now I’m going to use that inspect functionality again within

Our browser to see if we can pinpoint exactly where this information is that we want to parse so if I hover over my headline and right-click on that and go to inspect then we can see that it is a link inside of an h2 here with a class

Of entry title now if I go up a little more we’re trying to find something that encompasses all of our headline and our summary text and our video now if I hover over this article here with all these different classes if I scrolled down then we can see that that article

Encompasses our headline and our summary text and our embedded video now if I scroll down a little bit more than we can see that it stops after that first post so this is likely going to be our starting point since this contains all of the information within

This first post so if I scroll back up within this article we have this h2 with entry title that has our header there now if I expand this paragraph here then go down a little bit okay so that’s just metadata for the entry if I go over this

Entry content that seems to have the summary text and the embedded video so if I expand that then this first paragraph here is our summary text and the second paragraph here has the information for our embedded video okay so this is a good starting point so let’s start off by first grabbing this

Entire first article that contains all of this information so now I’m going to close the inspector and take this down to size a little bit so that we can see that at the same time that we’re working okay so to grab that first article let’s just say article is equal to soup dot

Find and then we will search for article so if I save that and now let’s also print out this article and put in a space there and run that now this is all kind of a mess here so we can actually pretty PHY these tags as well so if I do

A pretty Phi on this tag and save that and run it now we can see that this tag is well structured as well so now we can see that we got all the HTML for that first article so we can see that we have the link here that contains the title

For that so this is a video about Python regular expressions and then within if we go down here a little bit more then we have the text summary for that and we also have the embedded YouTube video so we have all the information for that first article where we can begin parsing

Out the headline and summary and video so first let’s grab the headline so if we look in the HTML we have our h2 and within that h2 we have a link and the text of that link contains the headline so for now let’s just comment out where

We’re printing out the HTML for that article and now let’s just say headline is equal to and we want to do use the article HTML here and not the entire soup so let’s say article dot H to dot a to grab that anchor tag and then text to

Grab the text out of that anchor tag so now let’s print out that headline so if I save that and run it then we can see that we did get the title of that latest post which is that tutorial on regular expressions now I think that this

Headline link here is actually the first link within our article so I don’t think we actually needed this H to parent tag here so if we just an article dot a dot text then I believe that we would have gotten the same result but it doesn’t

Hurt to be a little overly specific here but you just don’t want to get carried away and put in every single parent tag because then that’s going to stretch your line out far longer than it needs to be and just look more confusing than

It needs to be so it’s okay to be a little overly specific but just don’t get carried away okay so now that we’ve got the headline of this latest post now let’s get the summary text for this post so I’m going to comment out where we got the headline and uncomment out our

Prettified article HTML and reprint this back out so that we can look and see where this summary text is so our summary text is within a paragraph tag and that paragraph tag is within a div with a class of entry content so to grab that let’s comment out our article dot

Prettify again and below our headline let’s just say summary is equal to article dot and we’re going to use that fine method because we’re going to be searching for a div with a specific class so we want to find a div and to search for a specific class we can just

Pass that in as an argument so we can say class and that’s going to be class underscore is equal to entry – content so all of this here is going to return the tag for this div here so it’s going to return all of this information so within this

Div we want to parse out the first paragraph so we can just do dot P and now within that paragraph we want the text of that paragraph so we can just string all this together so dot P dot text so now if we print that out and

Save that and run it then we can see that we correctly parsed out the summary text for that post okay so lastly we need to get the link to the video for this post now this one is going to be a little more difficult but I want to show

You this because sometimes parsing information can be a little ugly and required you to take several steps before getting to your final desired result so on this website these videos are embedded so if we comment out our summary here and then uncomment out our article dot prettify HTML if we run this

And then find our video that is embedded it should be in an iframe which is right here so the source attribute of this iframe is to the embedded version of the video it’s not the direct link to the video itself but if you know how YouTube

Videos work they all have a video ID and the ID for this video is actually right here I just highlighted it now the question mark in the URL it specifies where the query parameters start so it’s not part of that video ID so with that ID we could actually create the link to

The video ourselves so we need to parse that ID from that URL so first we need to grab the URL from the iframe so just like before let’s comment out our article HTML go down below our summary and let’s go ahead and just grab this and we’ll say video source is equal to

Article dot find because we want to find a an iframe with a specific class we can see that this iframe has a class of YouTube player so I’m just going to copy this so we will find an iframe with a class and remember that underscore class equal to YouTube player so now

Let’s just print out what we have so far so I’m going to get rid of those spaces so let’s print out and this should be the HTML for that iframe so let’s run this we can see that we have the HTML for that iframe now unlike what we’ve

Been doing before we don’t want to grab the text from this tag what we really want is the value of that source attribute from the tag now if you want to get that value from an attribute of a tag then you can access it like a

Dictionary so at the end here after we grab that iframe we can just access this like a dictionary and say that we want the source attribute of that tag so now if I save that and run it now we can see that we got the link to that embedded

Video so now we’re going to have to parse this URL string to grab the ID of that video and we’ll break this up into several lines so first we can see that the ID comes after a forward slash here so let’s split up this string based on

Forward slashes so if I go down another line here I can say vid ID is equal to our vid source dot split and we want to split on a forged slash and now let’s let me take this down a little bit here and now let’s print this out so you can

See what this does and let me actually comment out the vid source there save that and rerun it now if you’ve never used the split method on a string then basically it just splits the string into a list of values based on the character that you specify so we can see that now

Our URL is broken to a broken into a list of several parts based on where those forward slashes were so if we look at the items in our list then our video ID is right here because it was right after a forged slash so that is an index

So this is index 0 1 2 3 4 so this is in index 4 so let’s specify that we want the fourth index of that returned list so after that split method we can just say that we want index 4 so now if we run this then we can see

That we’re getting closer so we have the video ID here and then we have these query parameters here at the end so like I said before the question mark specifies where the parameters for the URL begin and the video ID is before that so if we do another split on the

Question mark then it should separate those out so I’ll go to a new line so that we’re not making this one too long or too complicated and we can just say vid ID is equal to vid ID and now we want to split that based on the question

Mark so now if we save that and run it then now that got split up and our video ID is the first item of that list and the query parameters are the second item of that list so to grab the video ID we can just get the 0 index of that

Returned list so right after that I’ll just say that I want the 0 index so now if I save that and run it then we can see that there we got the video ID now I know that that was a lot of parsing but sometimes website source code doesn’t

Have the information that you want in the most accessible way so I wanted to show you how you might go about getting the data that you want even if it’s a little bit messy ok so now we can create our own YouTube link using this video ID

So the way YouTube links are formatted are like this so I’ll comment out the video ID for now and scroll down here a little bit we can just call this variable youtube link and we will set this equal to we’ll just do a formatted string here this will be HTTPS then

Youtube.com then Forge slash and the watch route and then the query parameter here is going to be a question mark with V which stands for video V equal to and we want to set that V equal to that video ID so I will just put in a

Placeholder there with that video ID so if we print out this YouTube link that we just created if I save that and run it then you can see that now we have this YouTube link now I used F strings to format that string but those are only available in Python 3.6 and above

If you’re using an older version of Python then you can use the format method on that string to insert that placeholder and I have a separate video on how to format strings if anyone needs to see how to do that and I’ll leave a link to that video in

The description section below but now that we’ve run this and got this link that we created so now if I copy this and paste this into my browser over here then we can see that that does go to that video goes directly to that video that we specified ok so perfect so we’ve

Scraped all the information that we wanted from that first article so just like in our earlier example with the simple HTML now that we’ve got the information for one article now we can loop over all the articles and get that information for all of them so to do

That we can just uncomment out the code that we grabbed here for the summary so I’ll uncomment out that I’ll uncomment out the code for the headline and I can remove our comment into out print statements here just to clean things up a bit let me remove our prettify article print

Statement there okay so just like we did before instead of just finding the first article now we want to find all of the articles so now we can just use the final method instead and remember this returns a list of all of those articles so instead of just setting that equal to

One variable called article we can do put in a for loop so we can say for article in that list then be sure we put in that colon there and now we have to put all of this information within our for loop so we will index or indent that

Over and save that and just like I did in our earlier example right here at the bottom I’m also going to put a blank print statement just so that it separates out the information from all of our articles so now if I run this then let me pull our output up here a

Little bit and scroll up to the top so we can see that we got the headline for our first article and the text summary for our first article and the link to that YouTube video and we did this for all the articles on the web page okay

Perfect okay so now we can see that that works getting all the information from the latest Artic on the homepage of the website now we’re almost finished up but let me show you a couple more things so sometimes you’ll run into situations where you’re missing some data and if that happens then it

Could break our scraper now maybe you’re pulling down a list of items and one is missing an image or something like that that you thought would be there so to show what this looks like I’m going to edit one of my posts here and remove the link to one of

The YouTube videos so instead of having you watch me log in to my webpage to do this I’m just going to fast forward this video a bit and skip to the point where I’ve edited this post okay so I logged in and edited my page so that there is

No longer a video link for the post a couple of numbers down here so you can see that this post here does not have a video associated with the post so now if I go back to our code that was just working before and I try to rerun this

Then we can see that it gets the first post just fine it gets the title and the summary text and the YouTube video link but for the second post here it gets the title and it gets the summary text but when it gets to the youtube link it

Breaks our script and it says that none type object is not subscript Abul and some weird errors there basically it’s breaking on this line here where it’s trying to find that iframe with the YouTube player class so if you run into something like this and you just want to

Skip by any missing information then what we can do is put that part of the code into a try except block so I’m going to pull down our output a little bit here now here at the bottom I’m just going to create a try except block and within sublime text this has

Autocomplete so I just click there for the try except and this gave me a little template here so within the try we want to take all of the code that gets that video information and we want to put that within our try block so I would

Just paste all that in and indent it correctly there and I meant to cut that out so I need to delete all of that and let’s get this print here and that out and we will put that below the try/except block okay so the way that we

Have this set up right now this youtube link variable will only get set if this succeeds here now in our exception if this fails then it’s going to go to our exception block here now sometimes people will just put in pass if they just want to skip over this but in our

Case we still want this youtube link variable to be set so instead of just passing here let’s set this youtube link variable equal to none just to say that we couldn’t get that youtube link okay so now with that code within a try except block let me make our output a

Little bit larger here so now if we save that and run it then we should get all of the information on our page so our top post here still works fine we got the title we got the summary text and we got the youtube link and for our second post

Which has the missing video we still have the title and we have the summary text and then the video is just set to none that variable set to none and then it just continues on with the other post after that so that’s what we wanted the video was missing but it didn’t break

Our program it still went and got the information for all the other posts on the page okay so now we’re done scraping the information so now I’m just going to up the sublime text here so that we can see everything a little bit larger here and scroll up here to the

Top so now that we’ve scraped the information that we want from our web page now we can save this in any way that we’d like so right now we’re just printing this information out to the screen and maybe that’s fine for your needs but you can also you know save it

To a file or say it save it to a CSV or anything that you’d like so for example real quick let’s say that we wanted to scrape this page and save that information to a CSV file so we’ve already done the hard part of getting the information that we want from the

Web page now to save it to a CSV file we could simply import the CSV module so we’ll import CSV then here at the top right before our for loop we can open a CSV file so we’ll just create a variable here called CSV file we’ll set

This equal to open and we want to call this CMS scrape dot CSV you can call that whatever you’d like and we want to write to this file so we’ll pass in a w-4 that now this video isn’t about working with files or CSV s I do have a

Separate video going into detail about how to work with CSVs but for this video we’ll just walk through really quickly so I’m not going to go into much detail here but we could use a context manager here but the way that we currently have our script setup I think it’ll just be a

Little quicker to just set this variable and open the file like this so now we can write some lines to set up our CSV file and again I’m not going to go into a lot of detail here I have a separate video on this if you’re interested so we

Can say a CSV writer is equal to CSV dot right so the right or a writer method of that CSV module and we want to pass in that CSV file that we just opened and now we want to write the headers of this CSV file so we can say CSV writer that

We just created and we can do a dot write row and we can pass in a list of values that we want to write to this row so we can create a list and we just are passing in the headers for now so our headers are going to be headline

And summary and we need to pass that in as text and also video link so those are the headers to our CSV file which are basically the column names that’s the data that we’re going to be saving to this CSV and now within our for loop where we’re getting that scraped

Information we can just write that information to our CSV file so at the very bottom of our loop after we print that blank line we can just write that data to our CSV with each iteration through our for loop so we can say CSV writer dot write row and we’re going to

Pass in a list here and the values that we want to pass in are going to be our headline first and then our summary second and then our YouTube blink third and lastly at the very end of our script outside of the for-loop since we didn’t

Use a context manager to open that file before we need to close our file here at the end of the script so we can say CSV file not CSV writer this is the actual CSV file I can say CSV file dot close so now if I run this code then you can see

That it prints out all the information like it did before but now if I open up my sidebar here we can see that now we have this CMS dot CSV file here in the side so I’m gonna open this within finder which is just within the file

System and now I’m going to open this with any kind of spreadsheet application now mine is numbers but yours might be Excel so now we can see that we have all this data available within our spreadsheet so let me maximize this here and make this to where it’s a little bit

More readable so I’ll make the columns a little bit smaller there and then wrap the text in all of our cells so we can see that we have all this information so here are our headers here headline summary and video link here are all of our headlines parsed out for us and our

Summaries and then you can see here in the video links with that second post where the video was missing this got posted in as blank there so there’s a none value there okay so now I can exit out of that and pull back up our script

Here okay so I think that is going to do it for this video hopefully now you have a pretty good idea for how you can go out and scrape information from websites now one thing that I do want to mention is if you want data from a large website

Like Twitter or Facebook or YouTube or something like that then it may be beneficial for you to see whether or not they have a public API public API is allow those sites to serve up data to you in a more efficient way and sometimes they don’t appreciate if you

Try to you know scrape their data manually they’d rather you go through the public API but it’s usually those larger websites that have those public P is so if you want data from you know a small or medium size website then likely you’ll have to go through and do

Something like we did here now also I should point out that you should be considerate when scraping websites so computer programs allow us to send a lot of requests very quickly so be aware that you might be bogging down someone’s server if you aren’t careful so try to

Keep that in mind so you know after this tutorial try not to go out and you know hammer my website with you know tons of requests through your program and that goes for other websites too some websites will even you know monitor if they’re getting hit quickly and they may

Even block your program if you’re hitting them too fast but other than that if anyone has any questions about what we covered in this video then feel free to ask in the comment section below and I’ll do my best to answer those and if you enjoy these tutorials and would

Like to support them then there are several ways you can do that the easiest way is to simply like the video and give it a thumbs up also it’s a huge help to share these videos with anyone who you think would find them useful and if you have the means you can contribute

Through patreon and there’s a link to that page in the description section below be sure to subscribe for future videos and thank you all for watching you

33 Comments

@kapibara2440

March 25, 2024 at 1:44 am Reply

Just another perfect video from Corey❤❤❤ Thank you man!
@markkennedy9767

March 25, 2024 at 1:44 am Reply

You da man, Corey! Great stuff.
@chandamark7301

March 25, 2024 at 1:44 am Reply

Nice video with clear explaination
@Vijay-cz7pe

March 25, 2024 at 1:44 am Reply

ConnectionError: HTTPConnectionPool(host='coreyms.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000029C009CBCD0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

This is the error I see whenever I try to retrieve the source code from your website.
@nadeemghaffar8531

March 25, 2024 at 1:44 am Reply

Great work … covered a lot of knowledge in a very efficient and concise way… Thank you, Corey Schafer
@sonal008

March 25, 2024 at 1:44 am Reply

What in case of diff div?? I have 4 articles with different div tag
@DJ-ct6so

March 25, 2024 at 1:44 am Reply

Excellent Corey. Very comprehensive, well explained, step by step, easy to follow. You're a very talented teacher imho.
@kelvinrubix4010

March 25, 2024 at 1:44 am Reply

what if you split the url at the question mark .thats easier…….your vidoes are always on point,you never fail us
@suparnaprasad8187

March 25, 2024 at 1:44 am Reply

One word. Legend.
@marvinluyombya2659

March 25, 2024 at 1:44 am Reply

🫂🫂🫂
@PankajKumar-tr7bg

March 25, 2024 at 1:44 am Reply

Thank You Sir for a nicely explained..
@ibrahimoladipupo5794

March 25, 2024 at 1:44 am Reply

Thanks, Corey, for this awesome tutorial. It is definitely the best available.
I would really appreciate your tutorial on how to use Web API for Web scraping.
Once again, thanks.
@centralscrutinizer5116

March 25, 2024 at 1:44 am Reply

You sir, by far, have the best structured videos for learning from. Great job there young feller!
@omonjonyokubov6204

March 25, 2024 at 1:44 am Reply

how do you make money web scraping?
@faheemkhan9786

March 25, 2024 at 1:44 am Reply

How use selenium in web scraping
@rachrach9871

March 25, 2024 at 1:44 am Reply

You’re an amazing tutor. Thank you so much for sharing your knowledge 🙏🏽 I’ve learned so much from your channel and I’m looking forward to more tutorials
@mangaart3366

March 25, 2024 at 1:44 am Reply

Love how you explain things so clearly. Corey you are the best!
@maxsalfer

March 25, 2024 at 1:44 am Reply

Here your production thru the years…could not include the df.plot() but many thanks!
headline headline_cum

date

2013-12-31 3 3

2014-12-31 17 20

2015-12-31 45 65

2016-12-31 18 83

2017-12-31 34 117

2018-12-31 18 135

2019-12-31 27 162
@z.heisenberg

March 25, 2024 at 1:44 am Reply

please help …[<p>Welcome back. Just a moment while we sign you in to your…….its saying this when i say paragraph=soup.find_all('p')

print(paragraph)

please help what to do..im stuck
@user-wn3eq6tq7o

March 25, 2024 at 1:44 am Reply

As always, fantastic content!
@heartbreaker7021

March 25, 2024 at 1:44 am Reply

when i try to scrape websites like Glassdoor or indoor it says access denied by cloudflare what should i do ?
@angelicking2890

March 25, 2024 at 1:44 am Reply

17:00
@ImVedanshAgarwal

March 25, 2024 at 1:44 am Reply

Thanks
@gh-sb1dy

March 25, 2024 at 1:44 am Reply

is this being done in command prompt or python terminal plese clarify
@mansbjork5721

March 25, 2024 at 1:44 am Reply

very understandable, thank you from a uni student making a project on web scraping
@noslackyak

March 25, 2024 at 1:44 am Reply

Great video, better than the others I watched. Thanks.
@user-ln6hz2nb7o

March 25, 2024 at 1:44 am Reply

Great Explanation!!!!!!!!!!!!!!!!!
Thank You Sooooooooooo much.
@proxyscrape

March 25, 2024 at 1:44 am Reply

5 years ago, this video was already a great tutorial. Now, it's even more valuable. Thanks for the great content!
@prashlovessamosa

March 25, 2024 at 1:44 am Reply

Thank you
@TotallyNotAuroras2ndChannel

March 25, 2024 at 1:44 am Reply

Corey is very clear and concise.
@TotallyNotAuroras2ndChannel

March 25, 2024 at 1:44 am Reply

Hi Corey. You're the best. Could you pls continue your python series into aws, cloud computing, lambda, serverless, …
@santosht5790

March 25, 2024 at 1:44 am Reply

How to extract the data if paragraph<p> is not there. Like – You are accessing the data from Google Patent website. There is no <p> as such in the HTML. Please reply
@aneeshkalita7452

March 25, 2024 at 1:44 am Reply

Thank you so much, Corey… I have been following your tutorials. Really improved my concepts. In gratitude!