Dynamic Javascript Scraping – Web scraping with Beautiful Soup 4 p.4
- June 12, 2024
- Posted by: MainInstructor
- Category: Go JavaScript
![*](https://i0.wp.com/allprowebdesigns.com/wp-content/uploads/2024/04/1712239591_hqdefault.jpg?resize=480%2C360&ssl=1)
Video Title: Dynamic Javascript Scraping – Web scraping with Beautiful Soup 4 p.4
what is going on everybody welcome to part four of our web scraping with beautiful soup 4 tutorial series in this tutorial we’re going to be talking about is how to scrape dynamically updated information from a web page so to begin I have added some information to the Parkes mEEMIC parse face page underneath this picture you can see this JavaScript dynamic data test and it just says look at you shine in it says that because we’re viewing it with a client in a browser and the browser is actually doing something that makes that show up let’s look further so viewing the source code swimmin in’ scrolling down here is what we’re looking for so this is what we I was just showing you whoops okay so what’s happening is this paragraph text is of class JavaScript test it’s got an ID of yes or no or yes no js’ and the starting text is why you bad though but we didn’t see that that’s just the starting text underneath it we’ve got a script so JavaScript and what it’s doing is it’s finding the element by an ID that ID is yes no js’ just here and when it finds that it’s going to say the dot enter HTML so we’re going to say okay now the HTML says look at you shine it that’s what we want it to say so when Google Chrome or whatever browser are you using browses this page it says look at you shine it it requested data from the server and the initial information that the server said was no no between the paragraph tags this is what you want to say but then it also said here’s a script though by the way and then chrome ran that script so actually you run that script the server didn’t run that script you ran that script so closing this out if we wanted to parse those tags we would use the following code you don’t need to write this out I mean you can if you want we’re going to use it in a moment well towards the end but hopefully if you’ve been following along you already have everything up to this just without these two lines anyway this is basically the code that we would use we’re going to parse paragraph tags with the class J’s test simple enough so let’s save and run that real quick and we get why you bad though that’s not even what we thought we were supposed to get so a lot of times maybe you’re going to parse a webpage that maybe you’re going to parse a table you’re expecting that table to have some data and lo and behold it has nothing like without so the problem is you aren’t a client you’re not a browser so what we have to do is mimic being a client or a browser and actually run that JavaScript which is actually a little bit more involved than you might think or maybe you are thinking that and realizing oh no so so there’s a whole lot of options at our disposal for how we’re going to do this I think the easiest way to do this is to use PI QT specifically we’ll be using QT 4 I’m sure you can do it QT 5 but I’m just going to use QT 4 so I do have a tutorial on PI QT for you don’t need to follow this entire tutorial but you should go to the first step the first page and you’ll need to get QT 4 if you’re on Windows go to this URL here and download the wheel for pi QT 4 there might be an installer now or something really simple you can check that out if you’re on Windows you can go to riverbank computing and I think you can install it from there it’s been so long I don’t remember what method I used on Windows but I have it it’s possible if you’re having problems leave a comment below and I’ll do my best to help you out so once you have QT 4 you’re ready to rumble so we’re going to go ahead and come up above these imports and we’re going to start importing a few new things first we’re going to import sis we’re doing that because QT applications want to be able to take system arguments and it’s going to be angry if you can’t so we’re going to do it then from PI QT 4 and I’m going to go ahead and just copy this so I don’t type of it but it’s capital P lowercase Y capital Q lowercase T 4 dot Qt GUI we’re going to import Q application then again and in fact actually I might as well do all that from pi q 4 dot QT core we’re going to import q your Alec by the way I didn’t really explain it q application is probably easy enough but it’s the thing for making applications q URL this is how we’re actually going to read the URL basically and then finally QT WebKit actually its capital K WebKit import q web page lovely so this is going to let us actually load the page and act like a browser act like a client and run that JavaScript so saving this a ton of programming is that would be involved there and in theory you could actually make the page show up even you can make your own web browsers in QT 4 as you might be able to surmise at this point anyway um cool so we have all the imports we need and now what we what we need to do is we’re going to write a client class and if you’re not too up to speed on object-oriented programming hopefully this one will be super simple enough I don’t yet but very soon I should have an object-oriented programming series so if you did if you are watching this sometime after the original upload date you should check out that if you need help anyway class client client is going to inherit from the Q web page and we’re just going to make some slight modifications just tiny ones uh define and we’re going to do our init method it’s going to take self and it’s going to take a URL of some kind what we’re going to say is self dot app is equal to this Q application it accepts sista argh oh my gosh argh P got it then we’re going to initialize the Q web page queue page dots in it self then self dot load finished dot connect self dot on page load so we’re initializing the key webpage or initializing the applet we’re starting up there we’re defining the application in theory we’re starting it up it does this will have its own in it as well but anyways we are starting up the queue web page then what we’re saying is we’re connecting this method here when the load is finished what do we want to do so initially what we could do without defining this client we could actually just work directly with the queue web page with a queue web page object access the URL and then use this dot mainframe dot to HTML stuff that I’m going to show you really quickly you could do that but the problem is the HTML you would get would be nothing it would be blank it would have nothing there because the page hadn’t actually loaded because PI QT 4 is an asynchronous library so it runs asynchronously and it’s not going to hang up so it’s going to keep doing code for you so when you go to say ok this is what the HTML is equal as soon as it just it initiates that code it’s then going to give you the HTML we don’t want to do that but instead we want to do is when the load is finished we want to run on page load whatever on page load might be we need to define on page load so define on page load obligatory self self dot app dot quit that’s all basically we just want this to run until the page loads and then when the page loads were done easy enough now what we’re going to say is let’s see client let’s do just move this down for now I’m going to pull some of this information so to start we’ll say the URL is equal to this cool then what we’re going to say is the client response is equal to client and then we’re going to pass that URL which will go through the init method then we’re going to say the source now the source code rather than being the URL Lib request we’re going to let QT handle the source and all this which is kind of what we’re doing up here we’re at least letting QT load the page and then now we’re going to grab the source code from Q web page basically so the source is going to be the client response because this client response is client object client object is inherited from the q web page so we can use the q web page methods now so client response dot main frame the main frame dot shoot is it I think get to capital HTML so that’s going to say okay this is basically Q webpage object let’s get the main frame you know the frame as in the thing that we’re looking at and then convert it to HTML you could do I don’t know if we’d actually be able to do it I don’t want to do it but you probably could do a dot show here we’ll do it after if this works we’ll do a dot show just to see if it works anyway I’m pretty sure it’s not show source although we’re running the app hits prime it wouldn’t work I don’t think we could show here I don’t know anyway I don’t think it’s going to work so the source we’ve handled otherwise everything else stays the same so we don’t need to do anything else let’s save and run that print jas test X none type has no attribute text what did we do wrong I’m blaming you all see client itself that have to quit oh we didn’t finish so it’s okay so stupid okay so after we say basically we needed to tell it what to do when the loading is finished before we actually load anything so what should have really given it away was the fact that we have Darrell here and we never actually do anything with you oh great okay so self dot main frame and then we use load and we load the queue you OH cue URL and then the cue URL loads the URL and then the app will exit because we don’t actually care that much okay let’s try it one more time boom we got what we wanted alright so that is that is how you can scrape dynamic data just for kicks I want to see if I can actually show I don’t I don’t think it’s going to work but yeah yeah because I guess that’s probably more I don’t know I’m not gonna mess with it anymore anyways so that’s how you can scrape dynamic data no one came here so they can see it show so I’m not going to waste any time trying to make that page show but just know that PI qt4 really is it’s basically a browser and you can make a browser really easily with QT for just a little bit more code and we would be able to make it a browser so a few things to note one basically you know it’s running it’s running it’s running it’s running it’s right oh there we finally got a response that took a while okay why did that take a while well we would have to look a little deeper but there’s basically whenever you like with let’s say web scraping especially there’s two pain points one pain point is probably loading pie QT for loading up this WebKit stuff loading up our browser client thing that’s going to take some time to like set all that stuff up okay and then we’re also going to process whatever javascript is there we have to process that javascript that’s going to take time okay so that’s one thing also when you’re parsing websites the other thing is just latency in response time at the server so a lot of people have asked me as I was released in this series what do we do about the fact that beautifulsoup is slow alright beautiful zhuzh not really that slow I mean it’s a fairly efficient framework the problem is a server request and response time is probably going to be like 500 milliseconds or more okay and so it’s not instant so if you’re trying to crawl 500 URLs that 500 milliseconds is suddenly 250 seconds for all 500 URLs that’s a really long time to wait so the two things you need to think about is if say you’re you’re mimicking a client for that you’re going to need to utilize multi-processing and just like I don’t have yet a tutorial on object oriented programming I will have a tutorial on multi-processing probably within a couple weeks so again if you’re watching this not immediately or if you are watching this immediately stay tuned for that uh so if for at least for using QT and stuff like that you should use multi processing so if that’s on your side now if it’s a latency issue and you’re just simply waiting for that response that means your CPU is actually idle you’ve got idle threads thus you can use the threading module and that will speed up the whole process and in reality you’re going to probably need to use both okay you’re going to want to make full use because you’re going to be waiting on other people’s processing sometimes and then many times it’s just going to be a bottleneck of your own processing so keep those things in mind if you’re having speed issues otherwise that should work for you guys so if you have questions comments concerns whatever feel free to leave them below otherwise as always thanks for watching things for all sports subscriptions and it’s next time
-
Sale!
Wireless WIFI Repeater Extender Amplifier Booster 300Mbps
$29.99$14.99 Add to cartWireless WIFI Repeater Extender Amplifier Booster 300Mbps
Categories: Electronics, Wi-Fi Router, Wireless Wi-Fi Extender Tags: 300Mbps, 802.11N, Amplifier, Booster, Extender, mobile wi-fi booster, Remote, WIFI, Wireless, Wireless WIFI, Wireless WIFI Repeater, Wireless WIFI Repeater Extender, Wireless WIFI Repeater Extender Amplifier, Wireless WIFI Repeater Extender Amplifier Booster, Wireless WIFI Repeater Extender Amplifier Booster 300Mbps$29.99$14.99 -
Sale!
Full RGB Light Design Gaming Headset Headphones with Mic
$24.99$14.99 Add to cartFull RGB Light Design Gaming Headset Headphones with Mic
Categories: Electronics, Gaming, Gaming Headsets Tags: Design, Full, Full RGB Light Design Gaming Headset, Full RGB Light Design Gaming Headset Headphones, Full RGB Light Design Gaming Headset Headphones with Mic, Gamer, Gaming, Gaming Headset Headphones, gaming headset wireless, Headphone, Headphones, Headset, Light, Mic, Package, RGB$24.99$14.99 -
Sale!
Wireless BlueTooth Multi-Device Keyboard Mouse Combo
$39.99$19.99 Add to cartWireless BlueTooth Multi-Device Keyboard Mouse Combo
Categories: Electronics, Gaming, Gaming Keyboards, Keyboard Mouse Combos Tags: Combo, Keyboard, keyboard mouse combos, Mouse, MultiDevice, Set, WireKeyboard Mouse Combo, Wireless, Wireless BlueTooth Keyboard Mouse Combo, Wireless BlueTooth Keyboard Mouse Combos, Wireless BlueTooth Multi-Device Keyboard Mouse Combo, Wireless BlueTooth Multi-Device Keyboard Mouse Combos$39.99$19.99 -
Sale!
High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
$199.99$139.99 Add to cartHigh Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
Categories: Gaming, Gaming Chairs Tags: Adjustable, Chair, computer chairs, Desk, Executive, Gaming, Girl, Headrest, High, High Back Leather Executive Adjustable Swivel Gaming Chair, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar, High Back Leather Executive Adjustable Swivel Gaming Chairs, Leather, Lumbar, Office, Racing, Swivel$199.99$139.99 -
Sale!
Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
$29.99$19.99 Select optionsProfessional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
SKU: N/A Categories: Electronics, Gaming, Gaming Headsets Tags: Cancelling, Gaming, Gaming Headphones with Noise Cancelling Microphone, gaming headset, Headphones, Headset, LED, Light, Mic, Microphone, Noise, Professional, Professional LED Light Wired Gaming Headphones, Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone, Wired, Wired Gaming Headphones, Wired Gaming Headphones with Noise Cancelling Microphone$29.99$19.99 -
Sale!
Gaming Desk with LED Lights USB Power Outlets and Charging Ports
$349.99$249.99 Select optionsGaming Desk with LED Lights USB Power Outlets and Charging Ports
SKU: N/A Categories: Computer Desk, Gaming, Gaming Desk Tags: and Charging Ports, Charging, Desk, Desks, Gaming, gaming desk with led lights, Gaming Desks with LED Lights, Home, LED, Lights, Monitor, Office, Outlets, Port, Power, Room, Stand, USB, USB Power Outlets, White, Workstation$349.99$249.99 -
Sale!
Wired Mixed Backlit Anti-Ghosting Gaming Keyboard
$99.99$79.99 Add to cartWired Mixed Backlit Anti-Ghosting Gaming Keyboard
Categories: Electronics, Gaming, Gaming Keyboards Tags: Antighosting, Backlit, Blue, brown, Gaming, Gaming Keyboard, gaming keyboards, gaming keyboards and mouse, Keyboard, Laptop, Switch, Wired, Wired Mixed Backlit Anti-Ghosting Gaming Keyboard, Wired Mixed Backlit Anti-Ghosting Gaming Keyboards, Wired Mixed Backlit Gaming Keyboard$99.99$79.99 -
Sale!
Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
$119.99$59.99 Add to cartWireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
Categories: Electronics, Gaming, Gaming Headsets Tags: 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, ANC, Audio, Bluetooth, Cancellation, Ear, Earphone, gaming headset, Headphones, Headset, Hi-Res Over the Ear Headphones Headset, HiRes, Noise, Wireless, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Headphones, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headsets$119.99$59.99 -
Sale!
Wired Sports Gaming Headset Earbuds with Microphone
$19.99$9.99 Select optionsWired Sports Gaming Headset Earbuds with Microphone
SKU: N/A Categories: Gaming, Gaming Headsets Tags: Accessories, Earbud, Earphone, Earphones, Gaming, gaming headset with microphone, Headphones, Headset, IOS, Microphone, Sports, Wired, Wired Sports Gaming Headset Earbuds, Wired Sports Gaming Headset Earbuds with Microphone, Wired Sports Headset Earbuds$19.99$9.99 -
Sale!
150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
$49.99$29.99 Add to cart150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
Categories: Charging Stations, Electronics Tags: 150W, 150W Charging Station, 150W Universal Multi USB Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Stations, 150W Universal Multi USB MAX Charging Station, 16 Port MAX Charging Station, 3.5A, Charger, Charging, Fast, laptop charging stations, Max, Multi, Port, Stand, Station, Universal, USB$49.99$29.99
If it is possible I would like give this video thousands of likes
Why can't you just parse the script tag instead of the p tag?
but, this is'nt working with pyqt5 and I'm unable to install pyqt4. What's the solution???
Spyder is not launching after installing PyQt
How to resolve content security error ,
I'm scrapping LinkedIn page
in 2021 I am unable to install PyQt4 on the latest version of Python 3.9. I use PyCharm under Windows 10 and just can't figure out how to get it to install. Any ideas would be greatly appreciated.
Could you please make an update video of this? PyQt has had a few updates or there is other modules to use. I'm trying to do it using selenium because I feel like it is the best for what I want but I just can't pass the "verify your identity" bs since webdriver doesn't take headers, and I haven't found a different way to do it. Thank you!!!
Using pyqt5 im getting Unresolved reference 'Client', i have the same code as the tutorial
I tried the PyQt5 equivalent to this, but I am not getting the expected results. The dynamic content still cannot be extracted. Any suggestions?
Great Tutorial Chum! Many thanks.
Hi, I have been working a lot lately on web scraping tasks and I was using selenium as it required interaction with the web page. My question is there a generic or more common way to extract any web page content instead of navigating and identifying tags which has required information. If not, why?
Also looking for how to control sending multiple requests to a server at a time while trying to fetch the data so that it would not stop taking my requests.
I am running below code and getting error, anyone has any idea ?
Code:
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage as QWebView
import bs4 as bs
import urllib.request
class Browser(QWebView):
def __init__(self,url):
self.app =QApplication(sys.argv)
QWebView.__init__(self)
self.loadFinished().connect(self.on_page_load)
self.load(QUrl(url))
self.app.exec_()
def on_page_load(self):
self.app.quit()
url='https://pythonprogramming.net/parsememcparseface/'
client_response= Browser(url)
source = client_response.page().toHtml()
soup=bs.BeautifulSoup(source,'lxml')
js_test=soup.find('p', class_='jstest')
print(js_test)
Error:
Traceback (most recent call last):
File "C:/Users/Vinit/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/PyQt4.py", line 21, in <module>
client_response= Browser(url)
File "C:/Users/Vinit/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/PyQt4.py", line 13, in _init_
self.loadFinished().connect(self.on_page_load)
TypeError: native Qt signal is not callable
It's better to use selenium webdriver (headless) instead of using PyQt to run JavaScript…
Selenium seems like a better option for scraping dynamic webpages
I was just searching for a problem with this and BAM, u have an entire series on webscraping. I think its the 5th time this happens. Just sayin realy appreciate ur channel.
ты сделал мой день. it's beter than all of this https://www.scrapehero.com/wp/wp-content/uploads/2018/06/open-source-web-scraping-tools-1.png
i like this way . it have good potential and simple as posible
Thanks for sharing this video. It got me closer to my goal than anything, I think. But I'm still not seeing the values that get populated in the table on the web page in question. Maybe you could head over to stackoverflow or tell me what I am doing wrong through some other medium? If I print the js_test.text, I get nothing. If I print just js_test then I get the <input…> from the HTML (no js result included). I really just want the value that is inserted from the js. Thanks
https://stackoverflow.com/questions/59849841/web-scraping-a-js-page
I've done something similar yesterday with PyQt5. I've combined html, javascript and python into one app (and some css goodies)
Dec 2019 Last example working from: https://stackoverflow.com/questions/47173791/cannot-use-qurl
import sys
import urllib.request
from bs4 import BeautifulSoup
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl, pyqtSignal, QEventLoop
from PyQt5.QtWebEngineWidgets import QWebEnginePage
class Client(QWebEnginePage):
toHtmlFinished = pyqtSignal()
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.load(QUrl(url))
self.app.exec_()
def on_page_load(self):
self.app.quit()
def store_html(self, html):
self.html = html
self.toHtmlFinished.emit()
def get_html(self):
self.toHtml(self.store_html)
loop = QEventLoop()
self.toHtmlFinished.connect(loop.quit)
loop.exec_()
return self.html
url = 'https://pythonprogramming.net/parsememcparseface/'
client_response = Client(url)
source = client_response.get_html()
#print(source)
soup = BeautifulSoup(source, 'lxml')
js_test = soup.find('p', class_='jstest')
print(js_test.text)
Why not use selenium
How do I scrape content of pseudo elements like ::before and ::after?
How can I get the Source Code showed in this video ? It could be faster than retype all 🙂 Thanks