Tables and XML – Web scraping with Beautiful Soup 4 p.3
Video Title: Tables and XML – Web scraping with Beautiful Soup 4 p.3
what’s going on everybody and welcome to part three of our web scraping the beautiful soup mini series in this tutorial what we’re me talking about is scraping tables and if we have time XML documents so let’s jump in I’m going to go ahead and delete from here and looking at this here so this is our table that we’re going to try to parse so looking at the source just in case anybody’s not too familiar with HTML table lling basically is gonna start with a table tag and then everything in between table tags has like TR tags for table rows and then within the row here we have a th for table headers as the header of the table then the rest of this is just TD tags for table data okay so we’re going to try to pull just the table data information from here so the way that we’re going to do that is first by defining table so table equals and in this case you could do you can do this like a couple of ways so remember before like for the navbar we said soup soup nav right you could do the same thing soup table right so we can print table save and run that and we get the table information we can also say table equals soup dot find table so we’re just going to overwrite it there okay it’s the same thing so you can use those two those two ways and at least so far we’re not seeing any difference close this and I’m gonna just come out that first one if I remember to all run it with both just so you can see at the end but anyways soup define table so now what we want to do is we’re gonna say table underscore rows equals table dot finds all because we’re going to look for all the table rows right we could do we could do table that TR or table define TR but it’s just going to find one of them we want all of them then we’re going to save for TR in table rows what do we want to do well we won’t find a table of data now so we’re going to find table data between the TR tags so TD would equal TR dot find all table data tanks now the row we’re just going to make a quick one liner for loop here so it’ll just be I dot text for I in table data and we need to say equals that then when we’re all done let’s print the row and let’s run that great so we get all the table data here you’ll notice this one’s empty that’s because that’s the table header and that doesn’t have TD tags it’s the table header but it was between table row tags now before we progress any further I will just show you pandas version of grabbing tables I think it’s a lot better so I think if we’re talking about scraping tables I definitely need to show this this is what I’ll usually use so if you don’t have pandas you can pip install panels but it will take a long time to install so if you don’t have it and maybe it’s not interesting yet to you or whatever you don’t have to grab it pandas is a data analysis library and if you are interested in pandas tutorials I have a bunch of them so what I’m going to do real quick is up at the top I’m going to import pandas as PD and then I’m going to comment all this out and then I’m just going to say the follow I’m going to say D F’s for data frames equals P D dot read HTML and then we just pass the link in there and what this is going to do is it’s going to go to this website and whatever you put in and it’s going to try to it’s going to parse all of the tables that can find and return a list of data frames because there might be multiple tables now we’re going to do is for data frame in data frames let’s just print D F dot head actually it should be short enough let’s just print the entire data frame cool so we get the whole thing now we could say when we go to do the read HTML we can say header equals 0 and that’ll make the first kind of row the header yes okay so that’s how you can use it with pandas and read tables and I think that’s so much more simple you can of course convert this to a list of lists like DF values to list okay you could do that if you wanted but it’s much more easy to sort manipulating running calculations or whatever on a data frame than it is just a bunch of lists anyway I thought I’d show that finally let’s get on to the XML documents if you’re not familiar with what an XML document is usually you’re going to see these in the form of sitemaps I put a link to it at the bottom here sitemaps are basically maps of all the URLs on your website okay so there will be some information here but as you’ll notice there it’s just between tags so XML was meant to be slightly more human readable okay so it’s like human and machine readable so here you can see basically all the links for Python programming dotnet now a lot of times people use sitemaps on like news websites and stuff because this is where the newest links you can find them so on like let’s say you go to Washington Post or something like that and you go to the Washington Post sitemap that’s going to have all the links for Washington Post so let’s go to Washington Post see what we find it’s probably at the bottom I’m just going to get Wow there’s no no okay okay it’s probably RSS yes I agree dang it I thought this is going to be quicker why they like hiding it now this Washington Post sitemap it’s probably like really obvious I’m just missing it okay so here’s one sitemap at least this is just their main site map but usually news websites I’m not going to waste much of time looking for it but usually news websites will have sitemaps even for specific topics like politics news or whatever so if you wanted to have some sort of bot that was constantly tracking news you would just track those sitemaps so closing that out let’s talk about reading the sitemap now so doing this is there’s like one slight difference in the soup but I’m going to copy this and in fact let’s just do it up here I’m going to paste it I’m gonna uncomment this out and rather than parse mEEMIC parse face it’s going to be sitemap XML and then rather than using L XML there when we create the soup we’re going to say XML then what we’re going to do is just so we are confident here let’s print soup cool so we got everything we need and we know that okay it’s between URL tags and you might want less modified so a lot of times they’ll have dates or whatever so to figure out how by visiting this link already you could use the date like on a news website but may basically what we’re interested in is the the location tag so pretty super cool so here all we would need to do is for URL in soup find all location print you our URL text okay and these are just all the Python program to net URLs so that is all the beautiful soup also I just realized I never did this second version of table so let’s just run that quickly so using just dot table or whatever you get the same thing okay just wanted to show that I’ve always been able to use those interchangeably I wish I could tell you what the difference was I’m sure there is one because there’s it wouldn’t be there I don’t think if there was no difference but anyway if someone is the difference feel free to comment below okay so that concludes the third a beautiful soup web scraping a tutorial if you have questions comments concerns suggestions whatever prefer to leave below otherwise till next time
-
Sale!
Wireless WIFI Repeater Extender Amplifier Booster 300Mbps
$29.99$14.99 Add to cartWireless WIFI Repeater Extender Amplifier Booster 300Mbps
Categories: Electronics, Wi-Fi Router, Wireless Wi-Fi Extender Tags: 300Mbps, 802.11N, Amplifier, Booster, Extender, mobile wi-fi booster, Remote, WIFI, Wireless, Wireless WIFI, Wireless WIFI Repeater, Wireless WIFI Repeater Extender, Wireless WIFI Repeater Extender Amplifier, Wireless WIFI Repeater Extender Amplifier Booster, Wireless WIFI Repeater Extender Amplifier Booster 300Mbps$29.99$14.99 -
Sale!
Full RGB Light Design Gaming Headset Headphones with Mic
$24.99$14.99 Add to cartFull RGB Light Design Gaming Headset Headphones with Mic
Categories: Electronics, Gaming, Gaming Headsets Tags: Design, Full, Full RGB Light Design Gaming Headset, Full RGB Light Design Gaming Headset Headphones, Full RGB Light Design Gaming Headset Headphones with Mic, Gamer, Gaming, Gaming Headset Headphones, gaming headset wireless, Headphone, Headphones, Headset, Light, Mic, Package, RGB$24.99$14.99 -
Sale!
Wireless BlueTooth Multi-Device Keyboard Mouse Combo
$39.99$19.99 Add to cartWireless BlueTooth Multi-Device Keyboard Mouse Combo
Categories: Electronics, Gaming, Gaming Keyboards, Keyboard Mouse Combos Tags: Combo, Keyboard, keyboard mouse combos, Mouse, MultiDevice, Set, WireKeyboard Mouse Combo, Wireless, Wireless BlueTooth Keyboard Mouse Combo, Wireless BlueTooth Keyboard Mouse Combos, Wireless BlueTooth Multi-Device Keyboard Mouse Combo, Wireless BlueTooth Multi-Device Keyboard Mouse Combos$39.99$19.99 -
Sale!
High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
$199.99$139.99 Add to cartHigh Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
Categories: Gaming, Gaming Chairs Tags: Adjustable, Chair, computer chairs, Desk, Executive, Gaming, Girl, Headrest, High, High Back Leather Executive Adjustable Swivel Gaming Chair, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar, High Back Leather Executive Adjustable Swivel Gaming Chairs, Leather, Lumbar, Office, Racing, Swivel$199.99$139.99 -
Sale!
Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
$29.99$19.99 Select optionsProfessional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
SKU: N/A Categories: Electronics, Gaming, Gaming Headsets Tags: Cancelling, Gaming, Gaming Headphones with Noise Cancelling Microphone, gaming headset, Headphones, Headset, LED, Light, Mic, Microphone, Noise, Professional, Professional LED Light Wired Gaming Headphones, Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone, Wired, Wired Gaming Headphones, Wired Gaming Headphones with Noise Cancelling Microphone$29.99$19.99 -
Sale!
Gaming Desk with LED Lights USB Power Outlets and Charging Ports
$349.99$249.99 Select optionsGaming Desk with LED Lights USB Power Outlets and Charging Ports
SKU: N/A Categories: Computer Desk, Gaming, Gaming Desk Tags: and Charging Ports, Charging, Desk, Desks, Gaming, gaming desk with led lights, Gaming Desks with LED Lights, Home, LED, Lights, Monitor, Office, Outlets, Port, Power, Room, Stand, USB, USB Power Outlets, White, Workstation$349.99$249.99 -
Sale!
Wired Mixed Backlit Anti-Ghosting Gaming Keyboard
$99.99$79.99 Add to cartWired Mixed Backlit Anti-Ghosting Gaming Keyboard
Categories: Electronics, Gaming, Gaming Keyboards Tags: Antighosting, Backlit, Blue, brown, Gaming, Gaming Keyboard, gaming keyboards, gaming keyboards and mouse, Keyboard, Laptop, Switch, Wired, Wired Mixed Backlit Anti-Ghosting Gaming Keyboard, Wired Mixed Backlit Anti-Ghosting Gaming Keyboards, Wired Mixed Backlit Gaming Keyboard$99.99$79.99 -
Sale!
Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
$119.99$59.99 Add to cartWireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
Categories: Electronics, Gaming, Gaming Headsets Tags: 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, ANC, Audio, Bluetooth, Cancellation, Ear, Earphone, gaming headset, Headphones, Headset, Hi-Res Over the Ear Headphones Headset, HiRes, Noise, Wireless, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Headphones, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headsets$119.99$59.99 -
Sale!
Wired Sports Gaming Headset Earbuds with Microphone
$19.99$9.99 Select optionsWired Sports Gaming Headset Earbuds with Microphone
SKU: N/A Categories: Gaming, Gaming Headsets Tags: Accessories, Earbud, Earphone, Earphones, Gaming, gaming headset with microphone, Headphones, Headset, IOS, Microphone, Sports, Wired, Wired Sports Gaming Headset Earbuds, Wired Sports Gaming Headset Earbuds with Microphone, Wired Sports Headset Earbuds$19.99$9.99 -
Sale!
150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
$49.99$29.99 Add to cart150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
Categories: Charging Stations, Electronics Tags: 150W, 150W Charging Station, 150W Universal Multi USB Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Stations, 150W Universal Multi USB MAX Charging Station, 16 Port MAX Charging Station, 3.5A, Charger, Charging, Fast, laptop charging stations, Max, Multi, Port, Stand, Station, Universal, USB$49.99$29.99
Thank you. It's simple and works well.
Wait, wait wait wait… wait.
I've just spent the last three evenings f***ing around with trying to get BeautifulSoup to read a table properly, when:
dfs = pd.read_html("url")
does everything I need and works perfectly?!?!
brb, going to scream into the void.
wow 4 years old, and it is still being the best parsing video out there
❤👍
What to do if there's multiple tables in webpage?
God dam i spent 4 hours today trying to extract a complex table with bs4, used pandas and it was done in two minutes.. Appreciate this knowledge ty
How To Install Linux And Use In Windows Using Putty???
Make Video Of It Bro.
Your tutorials are really helpful. I have a question. I wanted to extract table from pdf file. How can I do that?
how is he doing that thing with the commands?
how did you comment out an entire block of code with just one click?!!? could someone please tell me
I love this!
Can anyone explain this line of code? row = [i.text for i in td]
table_rows = table.find_all('tr')
AttributeError: 'str' object has no attribute 'find_all'
I`m getting this error. Can someone help?
Very helpful video, thanks a lot!
Thanks much this helped me.
I have a request, I need to extract data after login into website so what do I specify in soup url ?
bruh you look like snowden!
I've been trying to get data from a table with some missing values and this worked perfectly. I had been struggling with it for hours, thanks so much 🙂
when I put the code in as in 7:25 my result is an error: bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?
I looked on the Internet and they said it was because lxml insnt installed, but when i put: pip install lxml in command it tells me: Requirement already satisfied
Would really appreciate some help or another way to do webscraping with XML code
Your videos are incredibly helpful on the topic and additional tips, thanks a lot!
thanks for keeping it simple for beginners, +1 sub
PANDA IS LIFE
Pls, don't beat the keyboard 🙂
Thanks so much for producing so nice tutorial video. One question regarding parsing tables, what if there are multiple tables to be parsed in an HTML document?
How to parse table that is loaded in webpage after clicking few steps on webpage? And that table updated dynamically based on filters we choose
Thanks a lot
Thanks, helped a lot
Great little tutorial! I'm trying to implement something similar within a logic app / function app within Azure but having issues defining the source as the HTML come from an incoming email…any ideas?
Tell how to print th header
Great! It helps me to learn to use the beautiful soup in effective way.
Great stuff. I'm just learning Python and I needed to parse some tables for my script, to analyse the data. There are easy ways to do it using pandas, but I didn't want to use such a heavy library with so many dependencies for a rather simple task. Found some code examples on stackoverflow which were basically the same as you are showing here, but I couldn't wrap my head around that code. And I tend not to use the code I don't understand, on principle. Now that I've watched this, it's crystal clear to me. Cheers!
How to parse an XML file using Beautiful Soup
How can download a document like pdf?
I wan't to scrape a website that requires you to login. I've tried using RoboBrowser to login to the site and then BeautifulSoup to scrape the website but I only get the html of an error site saying that I must login to the site. Can anyone help me?
how do you save the data lol?
How can I scrape <td> of a <table> which is inside <div > tag ?
After you get your df from pandas, how do you select certain columns and then upload them into mysql database I already have setup with tables ready to go. I'm very new at python and programming in general so any tips or links to search will be much appreciated.
Love your work man, you my python guru 🙂
I appreciate you making these videos. Keep up the good work!