How to scrape a specific table form website using python (beautifulsoup4 and requests or any other library)?
How to scrape a specific table form website using python (beautifulsoup4 and requests or any other library)?
https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016)
.
Fortune top 10 E.U. corporations by revenue (2016)
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
@FanMan sorry for the trouble for not writing the code actually I am new to stackflow.... anyways I didn't catch what you wanna say by your answer... basically I am looking to fetch the table and its content....also the link i have provided that of wikipedia have several tables and I only want to fetch a particular with the title "Fortune top 10 E.U. corporations by revenue (2016)"....
– Kali
Sep 10 '18 at 17:13
@FanMan further more I am also interested to ask that in the for loop in your answer I found that you took the text variable and within the for loop you used text.findAll method and I dont know why but in my pycharm this doesn't work that is I can call findAll on soup(which is variable of BeautifulSoup) but not on text (which is further variable of soup)
– Kali
Sep 10 '18 at 17:17
I have added my answer. The answer you were referring to was not mine.
– FanMan
Sep 10 '18 at 18:39
2 Answers
2
following what @FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
You need to read stackoverflow.com/help/how-to-ask before asking a question. Were here to help, not teach. Please add your code from what you've already attempted and the issue that is coming up from your code. I would be happy to help you at that point.
– FanMan
Sep 10 '18 at 14:22