Use URLs scraped from a webpage with BeautifulSoup

Use URLs scraped from a webpage with BeautifulSoup



As per title, I have scraped the webpage I'm interested in and saved the URLs in a variable.


import requests
from bs4 import BeautifulSoup

for pagenumber in range(1, 2):
url = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22112%22%7D&page='.format(pagenumber)
res = requests.get(url, headers = 'User-agent': 'Chrome')

soup = BeautifulSoup(res.text, 'html.parser')
lists = soup.find_all("li", "class" : "expanded")

for bill in lists:
block = bill.find("span", "class":"result-item")
link_cosponsors = block.find_all("a")[1]['href'] # I am interested in the second URL



The last line is giving me the list of URLs. Now I am struggling to access each of these URLs and scrape new information from each of them.


for url in link_cosponsors:

soup_cosponsor = BeautifulSoup(requests.get(url).text, 'html.parser')
table = soup.find('table', 'class':'item_table')



I think the issue is with the way link_cosponsors is created i.e. the first element of the list isn't the full 'https://etc.' but only 'h', because I get the error "Invalid URL 'h': No schema supplied. Perhaps you meant http://h?".
I have tried appending the links to a list but that isn't working either.




1 Answer
1



The problem is that you're reassigning link_cosponsors at each iteration in the for loop. This way, this variable will hold only the last link you've found as a string.


link_cosponsors



What happens then is that your for url in link_cosponsors iterates over that string, letter by letter. Basically like this:


for url in link_cosponsors


for letter in 'http://the.link.you.want/foo/bar':
print(letter)



Solution: you should replace the last 3 lines of the first snippet by:


link_cosponsors =
for bill in lists:
block = bill.find("span", "class":"result-item")
link_cosponsors.append(block.find_all("a")[1]['href'])





I have tried doing that already (and just tried again), but what happens is that (assuming there are 100 URLs in the page) firstly it appends the first url, then it appends the first and second URLs, then the first second and third etc. So the first URL will be repeated 100 times. Not sure why it isn't working.
– Gilda Romano
Sep 2 at 21:13





humm, it's working here: 100 different URLs as the result.
– Valdir Stumm Junior
Sep 2 at 21:15





Are you sure? I still get 100 items, with the first of length 1, the second of length 2 ... the 100th of length 100 which means that the first URL is stored 100 times, the second 99 etc (you can see it easily by printing len(link_cosponsors) or print link_cosponsors). Maybe could you screenshot your output?
– Gilda Romano
Sep 2 at 21:23






@GildaRomano can you share your whole script? You can use a service such as pastebin.com for that and share the link here.
– Valdir Stumm Junior
Sep 3 at 14:11



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Some of your past answers have not been well-received, and you're in danger of being blocked from answering.



Please pay close attention to the following guidance:



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)