Python web scraping query
Python web scraping query
I have written my first ever python code to scrape a dividend history table from the web but soup.select statement doesn't seem to select anything and gives rise to an index error.
Any advice on how to resolve please?
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path='F:PythonAppsChromeDriverChromeDriver.exe')
driver.get("https://www.dividendchannel.com/history/?symbol=ibm")
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
table = soup.select("table#Dividend History")[0]
print(table)
list_row =[[tab_d.text.strip().replace("n","") for tab_d in
item.select('th,td')] for item in table.select('tr')]
for data in list_row[:2]:
print(' '.join(data))
File "F:/System/Python/dividend.py", line 9, in
table = soup.select("table#Dividend History")[0]
IndexError: list index out of range
that means your specified search is not found based on what you selecting. What you might want to do is change your select tag to: #divvytable > table
– hpca01
Aug 22 at 19:28
This kind of error will pop up when the selection doesn't have any data in it. It looks like
"table#Dividend History" is not a valid CSS selector for that page. The table you want is nested under "div#divvytable". Try starting there.– dustintheglass
Aug 22 at 20:30
"table#Dividend History"
"div#divvytable"
2 Answers
2
this is not a direct answer, but a recommendation. Depending on what you need it for, the website you have referenced has a limited usage based on IP, only can be accessed 6 times.
Take a look at the dividend api which is FREE(not advertising)->
IEX API
If you choose to use it, it might make your application that much more efficient. It is much easier playing with JSON data then converting to dataframe(PANDAS) or post to a front end via JavaScript.
here is a sample call for apply for last 5 years->
https://api.iextrading.com/1.0/stock/aapl/dividends/5y
You would use requests.get(url, params).json() and traverse it through a simple for loop.
It seems that layout of this page is based on tables, lots of tables. Your code is trying to find table with id of "Dividend", which does not exist.
Here is your code after some tweaks. It finds the rows with data, and then extracts data from the rows:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://www.dividendchannel.com/history/?symbol=ibm")
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
dividend_rows = soup.select("div#divvytable")[0].find_all("tr")
for row in dividend_rows:
columns = list(row.stripped_strings)
if len(columns) != 2:
continue
print("date: amount: ".format(columns[0], columns[1]))
Cheers. I really need to spend some time working out html structure! One further question - why was the google bot option necessary please?
– KitsuneMakai
Aug 22 at 21:21
@KitsuneMakai you are right, this part is not needed. I have removed it.
– user44
Aug 23 at 4:06
I also need to extra stock splits from the sister site but it doesnt seem to have any id tags in the relevent table
– KitsuneMakai
Aug 24 at 21:14
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
if you mean you need to parse the whole table with each date and Division column then you need to select each row separately and extract text out of it.
– Vishal Singh
Aug 22 at 17:22