Use URLs scraped from a webpage with BeautifulSoup
Use URLs scraped from a webpage with BeautifulSoup
As per title, I have scraped the webpage I'm interested in and saved the URLs in a variable.
import requests
from bs4 import BeautifulSoup
for pagenumber in range(1, 2):
url = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22112%22%7D&page='.format(pagenumber)
res = requests.get(url, headers = 'User-agent': 'Chrome')
soup = BeautifulSoup(res.text, 'html.parser')
lists = soup.find_all("li", "class" : "expanded")
for bill in lists:
block = bill.find("span", "class":"result-item")
link_cosponsors = block.find_all("a")[1]['href'] # I am interested in the second URL
The last line is giving me the list of URLs. Now I am struggling to access each of these URLs and scrape new information from each of them.
for url in link_cosponsors:
soup_cosponsor = BeautifulSoup(requests.get(url).text, 'html.parser')
table = soup.find('table', 'class':'item_table')
I think the issue is with the way link_cosponsors is created i.e. the first element of the list isn't the full 'https://etc.' but only 'h', because I get the error "Invalid URL 'h': No schema supplied. Perhaps you meant http://h?".
I have tried appending the links to a list but that isn't working either.
1 Answer
1
The problem is that you're reassigning link_cosponsors
at each iteration in the for loop. This way, this variable will hold only the last link you've found as a string.
link_cosponsors
What happens then is that your for url in link_cosponsors
iterates over that string, letter by letter. Basically like this:
for url in link_cosponsors
for letter in 'http://the.link.you.want/foo/bar':
print(letter)
Solution: you should replace the last 3 lines of the first snippet by:
link_cosponsors =
for bill in lists:
block = bill.find("span", "class":"result-item")
link_cosponsors.append(block.find_all("a")[1]['href'])
humm, it's working here: 100 different URLs as the result.
– Valdir Stumm Junior
Sep 2 at 21:15
Are you sure? I still get 100 items, with the first of length 1, the second of length 2 ... the 100th of length 100 which means that the first URL is stored 100 times, the second 99 etc (you can see it easily by printing len(link_cosponsors) or print link_cosponsors). Maybe could you screenshot your output?
– Gilda Romano
Sep 2 at 21:23
@GildaRomano can you share your whole script? You can use a service such as pastebin.com for that and share the link here.
– Valdir Stumm Junior
Sep 3 at 14:11
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I have tried doing that already (and just tried again), but what happens is that (assuming there are 100 URLs in the page) firstly it appends the first url, then it appends the first and second URLs, then the first second and third etc. So the first URL will be repeated 100 times. Not sure why it isn't working.
– Gilda Romano
Sep 2 at 21:13