BeautifulSoup 4: Extracting multiple titles and links from different ptag(s)

BeautifulSoup 4: Extracting multiple titles and links from different ptag(s)



HTML Code:


<div>
<p class="title">
<a href="/news/123456">title_1</a>
</p>
</div>

<div>
<p class="title">
<a href="/news/789000">title_2</a>
</p>
</div>



My Code:


def web(WebUrl):
site = urlparse(WebUrl)
code = requests.get(WebUrl)
plain = code.text
s = BeautifulSoup(plain, "html.parser")
p_containers = s.find('p', 'class':'title')

for title in s.find_all('p', 'class':'title'):
line = title.get_text()
print(line)
for link in p_containers.find_all('a'):
line2 = link.get('href')
print(site.netloc + str(line2))



Hi guys, I need some help with this, my task is to extract titles and links from a webpage, I was able to extract the titles but not the links. When I try to scrape the links, I got only the first link successfully scraped, the following links got ignored and replaced with the first scraped link.





Without checking I think the answer might be to change p_containers = s.find('p', 'class':'title') to p_containers = s.find_all('p', 'class':'title')
– ncfirth
Aug 22 at 8:05



p_containers = s.find('p', 'class':'title')


p_containers = s.find_all('p', 'class':'title')





No I was wrong, answer to follow!
– ncfirth
Aug 22 at 8:14





Opps, there's a missing indentation on the for loop, it's nested
– Charlie
Aug 22 at 8:23





If my answer was helpful can you mark it as accepted
– ncfirth
Aug 22 at 9:21




2 Answers
2



You have most of the bits in your code, but are just a little bit off. I think the most simple way to get the titles and links is by using the below.


site = """<div>
<p class="title">
<a href="/news/123456">title_1</a>
</p>
</div>

<div>
<p class="title">
<a href="/news/789000">title_2</a>
</p>
</div>"""

s = BeautifulSoup(site, "html.parser")

for title in s.find_all('p', 'class':'title'):
links = [x['href'] for x in title.find_all('a', href=True)]
line = title.get_text()
print(line)
print(links)



You can see that the links object is a list, that's just in case there's a situation where there's multiple links for each title.



Try this way it will help to find_all values from it.


from bs4 import BeautifulSoup

text = """<div>
<p class="title">
<a href="/news/123456">title_1</a>
</p>
</div>

<div>
<p class="title">
<a href="/news/789000">title_2</a>
</p>
</div>
"""

soup = BeautifulSoup(text, 'html.parser')
for i in soup.find_all('p', attrs='class': 'title'):
link = None
if i.find('a'):
link = i.find('a').get('href')
print('Title:', i.get_text(strip=True), 'Link:', link)
# Output as:
# Title: title_1 Link: /news/123456
# Title: title_2 Link: /news/789000






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌