Dynamic Data Web Scraping with Python, BeautifulSoup

Dynamic Data Web Scraping with Python, BeautifulSoup



I am trying to extract this data(number) for many pages from the HTML. The data is different for each page. When I try to use soup.select('span[class="pull-right"]') it should give me the number, but only the tag comes. I believe it is because Javascript is used in the webpage. 180,476 is the position of data at this specific HTML that I want for many pages:


<div class="legend-block--body">
<div class="linear-legend--counts">
Pageviews:
<span class="pull-right">
180,476
</span>
</div>
<div class="linear-legend--counts">
Daily average:
<span class="pull-right">
8,594
</span>
</div></div>



My code(this is in a loop to work for many pages):


res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
print(ab)



output:


[<span class="pull-right">n<label class="logarithmic-scale">n<input
class="logarithmic-scale-option" type="checkbox"/>n Logarithmic scale
</label>n</span>, <span class="pull-right">n<label class="begin-at-
zero">n<input class="begin-at-zero-option" type="checkbox"/>n Begin at
zero </label>n</span>, <span class="pull-right">n<label class="show-
labels">n<input class="show-labels-option" type="checkbox"/>n Show
values </label>n</span>]



Example URL:https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi



I want the Pageviews





You probably need selenium to extract that value
– Rakesh
Aug 23 at 11:24


selenium





wow....whats the difference ? .stackoverflow.com/questions/51982930/…
– Sarthak Negi
Aug 23 at 11:56




2 Answers
2



The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.



To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:


pip install selenium


from selenium import webdriver

browser = webdriver.Firefox()
# List of the page url and selector of element to retrieve.
wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
for wiki_page in wiki_pages:
url = wiki_page[0]
selector = wiki_page[1]
browser.get(wiki_page)
page_views_count = browser.find_element_by_css_selector(selector)
print page_views_count.text
browser.quit()



NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.





I believe the selector is not correct because it says it can't find the location
– Gokce
Aug 24 at 8:45





I got the idea. I am on the right track for sure. But the above .find_element_by_css_selector(selector) and my other different trials for search always give: <selenium.webdriver.remote.webelement.WebElement (session="195cd940ff11cbbe3ac113970aaa3025", element="0.49495071437839355-1")> EDIT: I got it, you gotta make the page load up first so just add these two lines: browser.maximize_window() browser.implicitly_wait(20)
– Gokce
Aug 24 at 8:52



You should try using the python plugin selenium.
It requires you to download a driver for whatever browser you are using.
You will then be able to use selenium to pull out values from the html


from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi")
element = driver.find_element_by_class_name("pull-right")
// or the following below
//element = driver.find_element_by_name("q")
//element = driver.find_element_by_id("html ID name")
//element = driver.find_element_by_name("html element name")
//element = driver.find_element_by_xpath("//input[@id='passwd-id']")
print(element)
driver.close()






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)