Python Tutorial: Web Scraping with BeautifulSoup and Requests
- March 25, 2024
- Posted by: MainInstructor
- Category: BASIC Go JavaScript Python
Video Title: Python Tutorial: Web Scraping with BeautifulSoup and Requests
Hey there how’s it going everybody in this video we’ll be learning how to scrape websites using the beautifulsoup library now if you don’t know what it means to scrape websites basically this means parsing the content from my website and pulling out exactly the information that you want so for example
Maybe you want to pull down some headlines from my new site or grab some scores from a sports website or monitor the prices of some items in an online store or something like that now to show an example of this let’s take a look at the finished product that we’ll be
Building in this video and then we will learn how to build it so I’m here on my personal website and on my home page here I have a lot of different posts of my most recent videos and every post that I have has a title here that is a
Big heading tag and then I have a text summary of the video here and then I have a link to the video so let’s say that we wanted to write a scraper that would go out and scrape all of this information so I wanted to grab all of
The post titles and summaries and links to the videos from my home page and I wanted to ignore all this other information so to do this I have a finished version of what we’ll be building in this video and then we’ll learn how to build it so right now I can
Just run this with Python and this is called CMS scrape pie so if I run this then this went out and scraped all of the titles and summaries and links so we can see here we have a title so this is my CSV module video and then we have the
Summary text here and then we have the link text here now not only did this go out and scrape this information from the website and print it out here in the terminal but it also created a CSV of all this information as well so if I
Open up this CMS scrape dot CSV this should open it up in numbers but you could also open it up in Excel now this isn’t very readable right now but if I make these columns a little bit smaller here and then wrap this text then we should be able to read this so
You can see we have a column that has all of our headlines for all the articles on that homepage and then all the text summaries and then a link to each video so that is what web scraping is it goes out and it pulls down all of the
Information that you want from a specific website so now I am going to clear those out and pull these back up now if you’re trying to parse out that information with something that you had built in Python yourself then you probably run into a lot of issues but luckily there’s the beautiful soup
Library that makes parsing out all this information a lot easier to do now we’ll also be using the request library in this video to make our web request now you could use the built-in URL Lib module but the request library is extremely popular for fetching websites
So we’re gonna go ahead and use that so let’s go ahead and get started and see how to do this so first of all let’s make sure that we have everything installed that we need so to install beautifulsoup you can just use the pip install command so to do
This we can just say pip install and this is beautiful soup and this is beautiful soup for so you can see that I already had that installed but if you don’t have that installed then your should just go through the installation at that point now you definitely want to
Install beautiful soup 4 because there is an older version just called beautiful soup but beautiful soup 4 will give you one that’s most up-to-date so once that’s installed then we need to make sure that we have a parser to parse our HTML now I won’t go deep into the
Details of these parsers but there are some small differences between the parsers and they could return different results depending on the HTML that you’re trying to parse now if you’re trying to parse perfectly formed HTML then those differences aren’t going to matter but if there are mistakes in the
HTML then the different parsers will try to fill in missing information differently so beautiful soup has a section in their documentation about the differences between those parsers and basically they suggest installing and using the L XML parser so that’s what we’re going to use in this video now
They also say that the html5 Lib parser uses techniques that are part of the html5 standard so you could use that one too but most of the time the choice between the parsers isn’t really going to matter all that much as long as you’re working with good HTML but I’ll go ahead and
Leave a link to the differences between those parsers and the description section below if you want to read more about those so to make sure that we have the L XML parser installed we can install it with pip also so we could just say pip install
And that is El XML so if we run that then I already have that installed but yours will install there if you don’t already have that now if you want the html5 Lib parser then you can just do a pip install html5 Lib so like I said
We’ll be using El XML in this video but the html5 Lib is popular as well so now we also need the request library and just the same we can do a pip install request and run that you can see that mine’s already installed but if you don’t have it then yours should get
Pulled down right there okay so now that we have those installed let me clear that out and now let’s take a look at what we can use these for now you don’t have to be extremely familiar with HTML in order to scrape websites but it definitely helps to know so basically
HTML is structured in a way where all the information is contained within certain tags and if you’re at all familiar with XML then it’s very similar to that now I have a very extremely basic HTML file open here in my browser so we can see that this small example
Just has one big hit header here that says test website and then we have two large links here for articles and one is the article 1 headline and then it has a small text summary here below that and then we have a big article 2 headline
Here with a text summary below that and then we have a footer down here at the bottom now this is how browsers display HTML we are using the Chrome browser right now but in the background the source code looks a bit different so I have the source code for this very basic
Website pulled Oh pulled up over here on the right side of my screen so let me make this a little smaller here and then I will stretch this over so that we can better see the source code so we can see how this is structured so we have these
Tags throughout our document and there are opening tags that are surrounded by these angle brackets here so we have this head tag that opens the tag and they also have closing brackets down here which are the same except they have a forward slash after the first angle
Bracket so the close of our head tag will be this line here and everything all this content is within this head tag so all this here is a single head and all of these tags can be nested so if we want to find our article headline and article summaries then we can look
Down here in our body tag so we have an opening body tag here and within the body we have our you know test website h1 here which is a heading and then we have a div tag here which has a class of article and within this div we have our
H2 tag and h2 is another heading a subheading and within that h2 we have a link these a tags which are anchor tags these are links so this is the text to the link here article 1 headline that’s what gets displayed over here in the actual website we can see article 1
Headline and this href this is actually where this links to so this links to a page article 1 dot HTML now these classes here how this has a class of article these are mainly used for CSS styling and can also be used within JavaScript to identify specific elements
Now below that heading tag that we looked at then we just have a paragraph tag here which is just a P and this is the text summary of that article so we can see here that this is this entire div with the class of article has our h2
Heading and then our paragraph for the summary and then this is just repeated down here so for our second article we have another div with the class article and then another h2 but this one is for the article – head headline and the article – link and then the summary text
For article – and then lastly we have a footer down here at the bottom that is just a div with the class of footer and that has a paragraph tag within there with some text so everything else in here is just extra information so we
Have some scripts and up here at the top we have some style sheets and things like that but all of this and the body is what gets displayed over here in the website so let’s use this very simple example to see how we can parse out information using beautifulsoup
So I’m going to open up a file here called scrape pie now all we have in here so far are our imports for a beautiful soup and requests so we have from bs4 import beautiful soup and then we’re also importing requests so let’s say that we wanted to parse out the article headline
And the summaries from our very simple website over here so in this example it’s just article one and it’s summary text and then article two headline and it’s summary text so first things first let’s pass our HTML in the beautiful soup so that we can get a beautiful soup
Object now there are a couple ways to do this we can either pass in the HTML as a string which is what we’ll do in a minute when we parse our website from the internet but we can also pass in an HTML file and in our case we have this
Sample HTML file within our current directory so let’s go ahead and just open up this file and pass it in to beautifulsoup so to open up this file we can just say with open and this HTML file is called simple dot HTML and it’s within the same
Directory of our script so we don’t have to specify a path and then we’re just going to read that in so read is the default so we don’t have to do anything there and I’ll just say as HTML file and then the pass dot HTML file into beautifulsoup
We can just say soup equals beautiful soup and then we will pass in that HTML file and now we need to specify our parser and like I said for this video we are going to use the L XML parser now if working with files is new to you and you
Want to know more about this like with open statement and things like that then I do have a video specifically on working with file objects and I’ll leave a link to that in the description section below ok so now we have this soup variable which is a beautiful soup
Object of our parsed HTML so let’s just print this out and see what we get so we can just print out soup so if I save that and run it then let me make this a little bit bigger here so we can see that this just prints out all of the
HTML and so it’s very similar to what we just looked at now this HTML isn’t formatted in a very readable way it’s all you know pushed over to the left if we actually look at that simple dot HTML file we can see that it’s nice and
Indented so in order to format this to where we can more clearly see which tags are nested within each other then we can just use the prettify method to clean this up a bit so if we say soup dot prettify and that is a method so we have to put in
If we save that and run it now we can see that it in dents these two where we can see what tags are nested within each other so here is that head tag that we saw before and then everything that is indented within that head head tag is
Belongs to that head tag okay so now let’s see how to grab information from this HTML so the easiest way to get information from a tag is to just access it like an attribute so if we wanted to grab the title of our HTML page and if I
Look here at our HTML this should just be test a sample website is our title for this so the easiest way to get that is to just access it like an attribute so I will say match equals and we will do soup dot title and then we will just
Print out that match so I’ll save that and run it then we can see that it parsed out that title tag now it still has the title tags around the text so if we only wanted to grab the text of the title tag then we can access the text
Attribute of that tag so we can just add that to the end here so we’ll say dot title dot text so if I save that and run it then you can see that now we only get the text of that title tag now searching for a tag like we did here by accessing
It like an attribute by saying dot title that will get the first title tag on the page but the first tag on the page not all might not always be what we want so we can use the fine method to do something similar but it will also allow
Us to pass in some arguments that we can find the exact tag that we’re looking for so for example if I use this dot access to find the first div on the page and I do soup dot div if I save that and run it then we can see that it got the
First div tag on our page with all of its child tags which is everything for that first article but if we wanted to grab the div tag that has a class of footer for example then we’ll have to use that fine method and pass in some arguments so let’s use that fine method
So we’ll do soup dot find and now we will search for a div now if I save that and run that right there then we just get the same thing we just get the first div on the page but with this fine method we can pass in arguments of
Attributes that narrow down exactly what tag we want to find so for example I can pass in an argument of class and then after class we need an underscore class underscore equals footer now these arguments can match any attributes that your tag might have and most of the time
You can just pass in arguments just like they are in the HTML so if you wanted to match a div with an ID of footer then you could just pass in an argument of ID equals footer but the reason that we need an underscore after class is
Because class is a special keyword in Python so they use class underscore instead so if you were confused about that then that’s why they have that so if we save that and run it then we can see that now we’re not getting the first div on the page we’re actually getting
The div with the class of footer okay so now let’s say that we wanted to parse the HTML and get all the article headlines and summaries from our page now anytime that we want to get multiple things from a page a good way to start
Is to just get one of whatever it is that you’re trying to parse so for example if I wanted to get grab the headline and snip it from each article on our page over here then let me start by first grabbing that information for one article and once we have that
Working then we can apply the same logic to all of our articles so if we go back here to our browser and look at our page now in order to dig down into the HTML and find exactly where our article headline and summary is within the chrome browser we can just right click
On whatever it is that we want to parse and then click on inspect now I know this is a little small let me make this just a little bit bigger here and then we’ll walk through a little bit of how to use this so I’m using chrome here but
Pretty much every major browser anymore has something like this and this is really useful for finding exactly what you want so within the inspect here if I just hover over our div class of article then you can see that in the top part here it’s actually highlighting that
Entire everything that is within that div and if I go down to the h2 then it only highlights that h2 and then if I hover over the href then it highlights that link and if I hover over the paragraph it highlights that paragraph so we can see exactly what is what and
The same with the second article if I go down here and hover over this article that I can see that that has the article to headline and summary text and I can click on this little arrow here to expand this and then it shows me everything that is within that div there
So we have the h2 they ate the anchor tag and then the paragraph tag with the summary text so just like we saw before in the source code our article headlines are within a div with a class of article and then an h2 and then an anchor tag so
Let’s go ahead and grab the article div so let me make this little smaller here so that we can see this so let’s grab that first div with the class of article so I’m going to change this variable name here over to article and then print
That out now this is going to be a div with the class of article so if we save that and run it then we can see that now we have that first article and we can search that matched tag just like we searched the entire HTML document so we
Can access child tags with the dot access like an attribute or we can use the find method so for example if we wanted to dig down into the text of the headline then we could say headline is equal to and now we don’t want to use that entire soup which is the entire
HTML we only want to search within this article so now we’ll say article dot h2 and within that h2 we want to access the anchor tag so that is dot a and now we want the text of that acre tag so we can just string all of that together so with
All that strung together if I print out that headline and save that let me comment out this entire article for now so if I save that and run it then we can see that it grabbed the text of that first articles headline and we can do the same thing with the article summary
So it’s just a paragraph within our article so if we go down a couple lines here then we could say summary is equal to and that is article dot P so just to grab that paragraph dot text so if we print out that summary save that and run
It then we can see that now we have that article one headline text and then we have the text summary of that our as well okay so now we have the code here for grabbing a headline and a summary from a single article so now that we have this information for one
Article we can most likely use this reuse this information to parse the information from all of our articles so right now we’re using this fine method to just get the first article but now we need to loop through all of the articles so to get all of the articles we instead
Of using find we can just use the fine all method now with find all instead of just returning the first tag that matches these arguments it will instead return a list of all the tags that match those arguments so instead of just setting this variable we can now loop
Over the list that that returns so instead of saying article equals we can just create a for loop so we can say for article in soup dot find all since this returns a list so now we have a for loop there I’m just going to get rid of that
Print article line and then I’m going to put this logic here for grabbing the headline and summary from an article within that for loop and now it’ll loop through all the articles which in this case is just the two of them and we’ll get the information for both of those
And also let me put in one more blank print statement here within our loop so that at the end we have a blank line between our articles so if I save that and run it then now we can see that we have the article one headline and the
Summary for that article we also have the article two headline and the summary for that article okay so this is good so we’re starting to see how this would be useful for getting information from websites so now let’s do something similar but with an actual website so
Like we saw before I have my personal website pulled up here in the browser and like we saw if I scroll down then we can see that we have a lot of video headlines and summaries and the embedded videos that themselves so let’s say that we wanted to grab these titles and these
Summaries and links to the videos so first things first let me just delete what we had so far with our simple HTML file that we used and I’m also going to get rid of where we are in that file so first things first we want to get the source code from my
Website using the request library and to do this we can just say source equals request dot get and now we want to get my website which is just HTTP kori MS com now this request dot get will return a response object and to get the source code from that response object we can
Just add on dot text to the end so now this source variable should be equal to the HTML of my website so now we can pass this in to beautiful soup so now let’s see if that worked so if we print out soup dot prettify like we saw before
Then this should print out the formatted code for my website so it looked like that worked if I scroll up here we can see that this does look like HTML it’s kind of a mess because it’s a larger website but we can see that you know
These links seem to be coming from my website so it looked like that worked so now we can start parsing out the information that we want now just like before let’s start off by grabbing one videos information and then it will loop through to get the information for all
The videos so to grab the first headline and snippet for the first post on my page let’s inspect my website and see if we can figure out what the structure is so I’m going to make this a little larger here and now I’m going to use that inspect functionality again within
Our browser to see if we can pinpoint exactly where this information is that we want to parse so if I hover over my headline and right-click on that and go to inspect then we can see that it is a link inside of an h2 here with a class
Of entry title now if I go up a little more we’re trying to find something that encompasses all of our headline and our summary text and our video now if I hover over this article here with all these different classes if I scrolled down then we can see that that article
Encompasses our headline and our summary text and our embedded video now if I scroll down a little bit more than we can see that it stops after that first post so this is likely going to be our starting point since this contains all of the information within
This first post so if I scroll back up within this article we have this h2 with entry title that has our header there now if I expand this paragraph here then go down a little bit okay so that’s just metadata for the entry if I go over this
Entry content that seems to have the summary text and the embedded video so if I expand that then this first paragraph here is our summary text and the second paragraph here has the information for our embedded video okay so this is a good starting point so let’s start off by first grabbing this
Entire first article that contains all of this information so now I’m going to close the inspector and take this down to size a little bit so that we can see that at the same time that we’re working okay so to grab that first article let’s just say article is equal to soup dot
Find and then we will search for article so if I save that and now let’s also print out this article and put in a space there and run that now this is all kind of a mess here so we can actually pretty PHY these tags as well so if I do
A pretty Phi on this tag and save that and run it now we can see that this tag is well structured as well so now we can see that we got all the HTML for that first article so we can see that we have the link here that contains the title
For that so this is a video about Python regular expressions and then within if we go down here a little bit more then we have the text summary for that and we also have the embedded YouTube video so we have all the information for that first article where we can begin parsing
Out the headline and summary and video so first let’s grab the headline so if we look in the HTML we have our h2 and within that h2 we have a link and the text of that link contains the headline so for now let’s just comment out where
We’re printing out the HTML for that article and now let’s just say headline is equal to and we want to do use the article HTML here and not the entire soup so let’s say article dot H to dot a to grab that anchor tag and then text to
Grab the text out of that anchor tag so now let’s print out that headline so if I save that and run it then we can see that we did get the title of that latest post which is that tutorial on regular expressions now I think that this
Headline link here is actually the first link within our article so I don’t think we actually needed this H to parent tag here so if we just an article dot a dot text then I believe that we would have gotten the same result but it doesn’t
Hurt to be a little overly specific here but you just don’t want to get carried away and put in every single parent tag because then that’s going to stretch your line out far longer than it needs to be and just look more confusing than
It needs to be so it’s okay to be a little overly specific but just don’t get carried away okay so now that we’ve got the headline of this latest post now let’s get the summary text for this post so I’m going to comment out where we got the headline and uncomment out our
Prettified article HTML and reprint this back out so that we can look and see where this summary text is so our summary text is within a paragraph tag and that paragraph tag is within a div with a class of entry content so to grab that let’s comment out our article dot
Prettify again and below our headline let’s just say summary is equal to article dot and we’re going to use that fine method because we’re going to be searching for a div with a specific class so we want to find a div and to search for a specific class we can just
Pass that in as an argument so we can say class and that’s going to be class underscore is equal to entry – content so all of this here is going to return the tag for this div here so it’s going to return all of this information so within this
Div we want to parse out the first paragraph so we can just do dot P and now within that paragraph we want the text of that paragraph so we can just string all this together so dot P dot text so now if we print that out and
Save that and run it then we can see that we correctly parsed out the summary text for that post okay so lastly we need to get the link to the video for this post now this one is going to be a little more difficult but I want to show
You this because sometimes parsing information can be a little ugly and required you to take several steps before getting to your final desired result so on this website these videos are embedded so if we comment out our summary here and then uncomment out our article dot prettify HTML if we run this
And then find our video that is embedded it should be in an iframe which is right here so the source attribute of this iframe is to the embedded version of the video it’s not the direct link to the video itself but if you know how YouTube
Videos work they all have a video ID and the ID for this video is actually right here I just highlighted it now the question mark in the URL it specifies where the query parameters start so it’s not part of that video ID so with that ID we could actually create the link to
The video ourselves so we need to parse that ID from that URL so first we need to grab the URL from the iframe so just like before let’s comment out our article HTML go down below our summary and let’s go ahead and just grab this and we’ll say video source is equal to
Article dot find because we want to find a an iframe with a specific class we can see that this iframe has a class of YouTube player so I’m just going to copy this so we will find an iframe with a class and remember that underscore class equal to YouTube player so now
Let’s just print out what we have so far so I’m going to get rid of those spaces so let’s print out and this should be the HTML for that iframe so let’s run this we can see that we have the HTML for that iframe now unlike what we’ve
Been doing before we don’t want to grab the text from this tag what we really want is the value of that source attribute from the tag now if you want to get that value from an attribute of a tag then you can access it like a
Dictionary so at the end here after we grab that iframe we can just access this like a dictionary and say that we want the source attribute of that tag so now if I save that and run it now we can see that we got the link to that embedded
Video so now we’re going to have to parse this URL string to grab the ID of that video and we’ll break this up into several lines so first we can see that the ID comes after a forward slash here so let’s split up this string based on
Forward slashes so if I go down another line here I can say vid ID is equal to our vid source dot split and we want to split on a forged slash and now let’s let me take this down a little bit here and now let’s print this out so you can
See what this does and let me actually comment out the vid source there save that and rerun it now if you’ve never used the split method on a string then basically it just splits the string into a list of values based on the character that you specify so we can see that now
Our URL is broken to a broken into a list of several parts based on where those forward slashes were so if we look at the items in our list then our video ID is right here because it was right after a forged slash so that is an index
So this is index 0 1 2 3 4 so this is in index 4 so let’s specify that we want the fourth index of that returned list so after that split method we can just say that we want index 4 so now if we run this then we can see
That we’re getting closer so we have the video ID here and then we have these query parameters here at the end so like I said before the question mark specifies where the parameters for the URL begin and the video ID is before that so if we do another split on the
Question mark then it should separate those out so I’ll go to a new line so that we’re not making this one too long or too complicated and we can just say vid ID is equal to vid ID and now we want to split that based on the question
Mark so now if we save that and run it then now that got split up and our video ID is the first item of that list and the query parameters are the second item of that list so to grab the video ID we can just get the 0 index of that
Returned list so right after that I’ll just say that I want the 0 index so now if I save that and run it then we can see that there we got the video ID now I know that that was a lot of parsing but sometimes website source code doesn’t
Have the information that you want in the most accessible way so I wanted to show you how you might go about getting the data that you want even if it’s a little bit messy ok so now we can create our own YouTube link using this video ID
So the way YouTube links are formatted are like this so I’ll comment out the video ID for now and scroll down here a little bit we can just call this variable youtube link and we will set this equal to we’ll just do a formatted string here this will be HTTPS then
Youtube.com then Forge slash and the watch route and then the query parameter here is going to be a question mark with V which stands for video V equal to and we want to set that V equal to that video ID so I will just put in a
Placeholder there with that video ID so if we print out this YouTube link that we just created if I save that and run it then you can see that now we have this YouTube link now I used F strings to format that string but those are only available in Python 3.6 and above
If you’re using an older version of Python then you can use the format method on that string to insert that placeholder and I have a separate video on how to format strings if anyone needs to see how to do that and I’ll leave a link to that video in
The description section below but now that we’ve run this and got this link that we created so now if I copy this and paste this into my browser over here then we can see that that does go to that video goes directly to that video that we specified ok so perfect so we’ve
Scraped all the information that we wanted from that first article so just like in our earlier example with the simple HTML now that we’ve got the information for one article now we can loop over all the articles and get that information for all of them so to do
That we can just uncomment out the code that we grabbed here for the summary so I’ll uncomment out that I’ll uncomment out the code for the headline and I can remove our comment into out print statements here just to clean things up a bit let me remove our prettify article print
Statement there okay so just like we did before instead of just finding the first article now we want to find all of the articles so now we can just use the final method instead and remember this returns a list of all of those articles so instead of just setting that equal to
One variable called article we can do put in a for loop so we can say for article in that list then be sure we put in that colon there and now we have to put all of this information within our for loop so we will index or indent that
Over and save that and just like I did in our earlier example right here at the bottom I’m also going to put a blank print statement just so that it separates out the information from all of our articles so now if I run this then let me pull our output up here a
Little bit and scroll up to the top so we can see that we got the headline for our first article and the text summary for our first article and the link to that YouTube video and we did this for all the articles on the web page okay
Perfect okay so now we can see that that works getting all the information from the latest Artic on the homepage of the website now we’re almost finished up but let me show you a couple more things so sometimes you’ll run into situations where you’re missing some data and if that happens then it
Could break our scraper now maybe you’re pulling down a list of items and one is missing an image or something like that that you thought would be there so to show what this looks like I’m going to edit one of my posts here and remove the link to one of
The YouTube videos so instead of having you watch me log in to my webpage to do this I’m just going to fast forward this video a bit and skip to the point where I’ve edited this post okay so I logged in and edited my page so that there is
No longer a video link for the post a couple of numbers down here so you can see that this post here does not have a video associated with the post so now if I go back to our code that was just working before and I try to rerun this
Then we can see that it gets the first post just fine it gets the title and the summary text and the YouTube video link but for the second post here it gets the title and it gets the summary text but when it gets to the youtube link it
Breaks our script and it says that none type object is not subscript Abul and some weird errors there basically it’s breaking on this line here where it’s trying to find that iframe with the YouTube player class so if you run into something like this and you just want to
Skip by any missing information then what we can do is put that part of the code into a try except block so I’m going to pull down our output a little bit here now here at the bottom I’m just going to create a try except block and within sublime text this has
Autocomplete so I just click there for the try except and this gave me a little template here so within the try we want to take all of the code that gets that video information and we want to put that within our try block so I would
Just paste all that in and indent it correctly there and I meant to cut that out so I need to delete all of that and let’s get this print here and that out and we will put that below the try/except block okay so the way that we
Have this set up right now this youtube link variable will only get set if this succeeds here now in our exception if this fails then it’s going to go to our exception block here now sometimes people will just put in pass if they just want to skip over this but in our
Case we still want this youtube link variable to be set so instead of just passing here let’s set this youtube link variable equal to none just to say that we couldn’t get that youtube link okay so now with that code within a try except block let me make our output a
Little bit larger here so now if we save that and run it then we should get all of the information on our page so our top post here still works fine we got the title we got the summary text and we got the youtube link and for our second post
Which has the missing video we still have the title and we have the summary text and then the video is just set to none that variable set to none and then it just continues on with the other post after that so that’s what we wanted the video was missing but it didn’t break
Our program it still went and got the information for all the other posts on the page okay so now we’re done scraping the information so now I’m just going to up the sublime text here so that we can see everything a little bit larger here and scroll up here to the
Top so now that we’ve scraped the information that we want from our web page now we can save this in any way that we’d like so right now we’re just printing this information out to the screen and maybe that’s fine for your needs but you can also you know save it
To a file or say it save it to a CSV or anything that you’d like so for example real quick let’s say that we wanted to scrape this page and save that information to a CSV file so we’ve already done the hard part of getting the information that we want from the
Web page now to save it to a CSV file we could simply import the CSV module so we’ll import CSV then here at the top right before our for loop we can open a CSV file so we’ll just create a variable here called CSV file we’ll set
This equal to open and we want to call this CMS scrape dot CSV you can call that whatever you’d like and we want to write to this file so we’ll pass in a w-4 that now this video isn’t about working with files or CSV s I do have a
Separate video going into detail about how to work with CSVs but for this video we’ll just walk through really quickly so I’m not going to go into much detail here but we could use a context manager here but the way that we currently have our script setup I think it’ll just be a
Little quicker to just set this variable and open the file like this so now we can write some lines to set up our CSV file and again I’m not going to go into a lot of detail here I have a separate video on this if you’re interested so we
Can say a CSV writer is equal to CSV dot right so the right or a writer method of that CSV module and we want to pass in that CSV file that we just opened and now we want to write the headers of this CSV file so we can say CSV writer that
We just created and we can do a dot write row and we can pass in a list of values that we want to write to this row so we can create a list and we just are passing in the headers for now so our headers are going to be headline
And summary and we need to pass that in as text and also video link so those are the headers to our CSV file which are basically the column names that’s the data that we’re going to be saving to this CSV and now within our for loop where we’re getting that scraped
Information we can just write that information to our CSV file so at the very bottom of our loop after we print that blank line we can just write that data to our CSV with each iteration through our for loop so we can say CSV writer dot write row and we’re going to
Pass in a list here and the values that we want to pass in are going to be our headline first and then our summary second and then our YouTube blink third and lastly at the very end of our script outside of the for-loop since we didn’t
Use a context manager to open that file before we need to close our file here at the end of the script so we can say CSV file not CSV writer this is the actual CSV file I can say CSV file dot close so now if I run this code then you can see
That it prints out all the information like it did before but now if I open up my sidebar here we can see that now we have this CMS dot CSV file here in the side so I’m gonna open this within finder which is just within the file
System and now I’m going to open this with any kind of spreadsheet application now mine is numbers but yours might be Excel so now we can see that we have all this data available within our spreadsheet so let me maximize this here and make this to where it’s a little bit
More readable so I’ll make the columns a little bit smaller there and then wrap the text in all of our cells so we can see that we have all this information so here are our headers here headline summary and video link here are all of our headlines parsed out for us and our
Summaries and then you can see here in the video links with that second post where the video was missing this got posted in as blank there so there’s a none value there okay so now I can exit out of that and pull back up our script
Here okay so I think that is going to do it for this video hopefully now you have a pretty good idea for how you can go out and scrape information from websites now one thing that I do want to mention is if you want data from a large website
Like Twitter or Facebook or YouTube or something like that then it may be beneficial for you to see whether or not they have a public API public API is allow those sites to serve up data to you in a more efficient way and sometimes they don’t appreciate if you
Try to you know scrape their data manually they’d rather you go through the public API but it’s usually those larger websites that have those public P is so if you want data from you know a small or medium size website then likely you’ll have to go through and do
Something like we did here now also I should point out that you should be considerate when scraping websites so computer programs allow us to send a lot of requests very quickly so be aware that you might be bogging down someone’s server if you aren’t careful so try to
Keep that in mind so you know after this tutorial try not to go out and you know hammer my website with you know tons of requests through your program and that goes for other websites too some websites will even you know monitor if they’re getting hit quickly and they may
Even block your program if you’re hitting them too fast but other than that if anyone has any questions about what we covered in this video then feel free to ask in the comment section below and I’ll do my best to answer those and if you enjoy these tutorials and would
Like to support them then there are several ways you can do that the easiest way is to simply like the video and give it a thumbs up also it’s a huge help to share these videos with anyone who you think would find them useful and if you have the means you can contribute
Through patreon and there’s a link to that page in the description section below be sure to subscribe for future videos and thank you all for watching you
-
Sale!
Wireless WIFI Repeater Extender Amplifier Booster 300Mbps
$29.99$14.99 Add to cartWireless WIFI Repeater Extender Amplifier Booster 300Mbps
Categories: Electronics, Wi-Fi Router, Wireless Wi-Fi Extender Tags: 300Mbps, 802.11N, Amplifier, Booster, Extender, mobile wi-fi booster, Remote, WIFI, Wireless, Wireless WIFI, Wireless WIFI Repeater, Wireless WIFI Repeater Extender, Wireless WIFI Repeater Extender Amplifier, Wireless WIFI Repeater Extender Amplifier Booster, Wireless WIFI Repeater Extender Amplifier Booster 300Mbps$29.99$14.99 -
Sale!
Full RGB Light Design Gaming Headset Headphones with Mic
$24.99$14.99 Add to cartFull RGB Light Design Gaming Headset Headphones with Mic
Categories: Electronics, Gaming, Gaming Headsets Tags: Design, Full, Full RGB Light Design Gaming Headset, Full RGB Light Design Gaming Headset Headphones, Full RGB Light Design Gaming Headset Headphones with Mic, Gamer, Gaming, Gaming Headset Headphones, gaming headset wireless, Headphone, Headphones, Headset, Light, Mic, Package, RGB$24.99$14.99 -
Sale!
Wireless BlueTooth Multi-Device Keyboard Mouse Combo
$39.99$19.99 Add to cartWireless BlueTooth Multi-Device Keyboard Mouse Combo
Categories: Electronics, Gaming, Gaming Keyboards, Keyboard Mouse Combos Tags: Combo, Keyboard, keyboard mouse combos, Mouse, MultiDevice, Set, WireKeyboard Mouse Combo, Wireless, Wireless BlueTooth Keyboard Mouse Combo, Wireless BlueTooth Keyboard Mouse Combos, Wireless BlueTooth Multi-Device Keyboard Mouse Combo, Wireless BlueTooth Multi-Device Keyboard Mouse Combos$39.99$19.99 -
Sale!
High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
$199.99$139.99 Add to cartHigh Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
Categories: Gaming, Gaming Chairs Tags: Adjustable, Chair, computer chairs, Desk, Executive, Gaming, Girl, Headrest, High, High Back Leather Executive Adjustable Swivel Gaming Chair, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar, High Back Leather Executive Adjustable Swivel Gaming Chairs, Leather, Lumbar, Office, Racing, Swivel$199.99$139.99 -
Sale!
Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
$29.99$19.99 Select optionsProfessional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
SKU: N/A Categories: Electronics, Gaming, Gaming Headsets Tags: Cancelling, Gaming, Gaming Headphones with Noise Cancelling Microphone, gaming headset, Headphones, Headset, LED, Light, Mic, Microphone, Noise, Professional, Professional LED Light Wired Gaming Headphones, Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone, Wired, Wired Gaming Headphones, Wired Gaming Headphones with Noise Cancelling Microphone$29.99$19.99 -
Sale!
Gaming Desk with LED Lights USB Power Outlets and Charging Ports
$349.99$249.99 Select optionsGaming Desk with LED Lights USB Power Outlets and Charging Ports
SKU: N/A Categories: Computer Desk, Gaming, Gaming Desk Tags: and Charging Ports, Charging, Desk, Desks, Gaming, gaming desk with led lights, Gaming Desks with LED Lights, Home, LED, Lights, Monitor, Office, Outlets, Port, Power, Room, Stand, USB, USB Power Outlets, White, Workstation$349.99$249.99 -
Sale!
Wired Mixed Backlit Anti-Ghosting Gaming Keyboard
$99.99$79.99 Add to cartWired Mixed Backlit Anti-Ghosting Gaming Keyboard
Categories: Electronics, Gaming, Gaming Keyboards Tags: Antighosting, Backlit, Blue, brown, Gaming, Gaming Keyboard, gaming keyboards, gaming keyboards and mouse, Keyboard, Laptop, Switch, Wired, Wired Mixed Backlit Anti-Ghosting Gaming Keyboard, Wired Mixed Backlit Anti-Ghosting Gaming Keyboards, Wired Mixed Backlit Gaming Keyboard$99.99$79.99 -
Sale!
Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
$119.99$59.99 Add to cartWireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
Categories: Electronics, Gaming, Gaming Headsets Tags: 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, ANC, Audio, Bluetooth, Cancellation, Ear, Earphone, gaming headset, Headphones, Headset, Hi-Res Over the Ear Headphones Headset, HiRes, Noise, Wireless, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Headphones, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headsets$119.99$59.99 -
Sale!
Wired Sports Gaming Headset Earbuds with Microphone
$19.99$9.99 Select optionsWired Sports Gaming Headset Earbuds with Microphone
SKU: N/A Categories: Gaming, Gaming Headsets Tags: Accessories, Earbud, Earphone, Earphones, Gaming, gaming headset with microphone, Headphones, Headset, IOS, Microphone, Sports, Wired, Wired Sports Gaming Headset Earbuds, Wired Sports Gaming Headset Earbuds with Microphone, Wired Sports Headset Earbuds$19.99$9.99 -
Sale!
150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
$49.99$29.99 Add to cart150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
Categories: Charging Stations, Electronics Tags: 150W, 150W Charging Station, 150W Universal Multi USB Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Stations, 150W Universal Multi USB MAX Charging Station, 16 Port MAX Charging Station, 3.5A, Charger, Charging, Fast, laptop charging stations, Max, Multi, Port, Stand, Station, Universal, USB$49.99$29.99
Just another perfect video from Corey❤❤❤ Thank you man!
You da man, Corey! Great stuff.
Nice video with clear explaination
ConnectionError: HTTPConnectionPool(host='coreyms.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000029C009CBCD0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
This is the error I see whenever I try to retrieve the source code from your website.
Great work … covered a lot of knowledge in a very efficient and concise way… Thank you, Corey Schafer
What in case of diff div?? I have 4 articles with different div tag
Excellent Corey. Very comprehensive, well explained, step by step, easy to follow. You're a very talented teacher imho.
what if you split the url at the question mark .thats easier…….your vidoes are always on point,you never fail us
One word. Legend.
🫂🫂🫂
Thank You Sir for a nicely explained..
Thanks, Corey, for this awesome tutorial. It is definitely the best available.
I would really appreciate your tutorial on how to use Web API for Web scraping.
Once again, thanks.
You sir, by far, have the best structured videos for learning from. Great job there young feller!
how do you make money web scraping?
How use selenium in web scraping
You’re an amazing tutor. Thank you so much for sharing your knowledge 🙏🏽 I’ve learned so much from your channel and I’m looking forward to more tutorials
Love how you explain things so clearly. Corey you are the best!
Here your production thru the years…could not include the df.plot() but many thanks!
headline headline_cum
date
2013-12-31 3 3
2014-12-31 17 20
2015-12-31 45 65
2016-12-31 18 83
2017-12-31 34 117
2018-12-31 18 135
2019-12-31 27 162
please help …[<p>Welcome back. Just a moment while we sign you in to your…….its saying this when i say paragraph=soup.find_all('p')
print(paragraph)
please help what to do..im stuck
As always, fantastic content!
when i try to scrape websites like Glassdoor or indoor it says access denied by cloudflare what should i do ?
17:00
Thanks
is this being done in command prompt or python terminal plese clarify
very understandable, thank you from a uni student making a project on web scraping
Great video, better than the others I watched. Thanks.
Great Explanation!!!!!!!!!!!!!!!!!
Thank You Sooooooooooo much.
5 years ago, this video was already a great tutorial. Now, it's even more valuable. Thanks for the great content!
Thank you
Corey is very clear and concise.
Hi Corey. You're the best. Could you pls continue your python series into aws, cloud computing, lambda, serverless, …
How to extract the data if paragraph<p> is not there. Like – You are accessing the data from Google Patent website. There is no <p> as such in the HTML. Please reply
Thank you so much, Corey… I have been following your tutorials. Really improved my concepts. In gratitude!