scraping only a particular tag without details from nested tag in that particular tag

scraping only a particular tag without details from nested tag in that particular tag



I have a page where the structure is something like


<body>
<article> <!--article no 1-->
<h3>
<h2>
<h1>
<a> <!--first 'a' tag-->

<article> <!--article no 2-->
<h1>
<h2>
<a> <!--second 'a' tag-->
</article>
</article>
</body>



Now what I want is I want to extract all 'a' tags inside an article but such that no 'a' tag comes from any nested



that is


articles = browser.find_elements_by_tag_name("article")
for i in article:
print(i.find_elements_by_tag_name("a")



for first article
now i.find_elements will return all 'a' tags inside this article tag which will also include 'a' tags inside 'article tag' that is itself nested in current article tag but i dont want that



I want if i call find_elements on article no 1 'a' tags in article no 2 or in any nested article should not come





Do you mean you don't want to extract anything from the nested (parent) <article> <!--article no 2--> tag?
– DebanjanB
Sep 4 '18 at 7:24


<article> <!--article no 2-->





yes I dont want details from any nested article
– Nimish Bansal
Sep 4 '18 at 7:24





Can you update the HTML with the parent tag of the <article> <!--article no 1--> tag?
– DebanjanB
Sep 4 '18 at 7:27



<article> <!--article no 1-->





i didnt get you ?
– Nimish Bansal
Sep 4 '18 at 7:29





The parent tag/node of our desired tag i.e. <article> <!--article no 1--> would have been helpful
– DebanjanB
Sep 4 '18 at 7:31


<article> <!--article no 1-->




3 Answers
3



If you want links from not nested articles, try:


articles = browser.find_elements_by_tag_name('article'):
for article in articles:
print(article.find_elements_by_xpath('./*[not(descendant-or-self::article)]/descendant-or-self::a'))





what if I have an article object article=articles[0]?will it work like article.find_elements_by_xpath('a[count(ancestor::article) = 1]') and will it return 'a' tags only within this article tag and not from any other article tags nested in this tag
– Nimish Bansal
Sep 4 '18 at 7:37





Yep. In this way article.find_elements_by_xpath('.//a[count(ancestor::article) = 1]')
– Andersson
Sep 4 '18 at 7:37


article.find_elements_by_xpath('.//a[count(ancestor::article) = 1]')





yep that worked for parent article but what if I have an instance of nested article i.e article=articles[1] where articles[1] article is nested inside articles[0] and i dont want details from any other article tag nested inside articles[1]
– Nimish Bansal
Sep 4 '18 at 7:44






It should work for nested articles also. But IMHO you should initially skip that nested articles as articles = browser.find_elements_by_xpath('//article[not(ancestor::article)]')
– Andersson
Sep 4 '18 at 7:45



articles = browser.find_elements_by_xpath('//article[not(ancestor::article)]')





I didnt get you. actually I want details of all article tags whether it is nested or not.But the point is for one particular article I dont want any details from anyother nested article inside that particular article
– Nimish Bansal
Sep 4 '18 at 7:50




Parse the article element with BeautifulSoup and get all the anchor tags in ease.


article


BeautifulSoup


from bs4 import BeautifulSoup
articles = browser.find_elements_by_tag_name("article")
links =
for i in articles:
soup = BeautifulSoup(i.get_attribute('outerHTML'), 'html5lib')
a_tags = soup.findAll('a')
links.extend(a_tags)



Hope this helps! Cheers!





my question does not ask for a tags inside nested articles
– Nimish Bansal
Sep 4 '18 at 8:28






Oops! My bad! Thanks for notifyiing my mistake!
– SmashGuy
Sep 4 '18 at 8:46



using BeautifulSoup,



try to find all <a> under <articla> like ('article a')


<a>


<articla>



then use find_parents() method of beautifulsoup.



If length of ('article a').find_parents('article') is bigger than 2, that might be nested like this.


<article>
..
<article>
..
<a>



so if you remove them you will get <a> that has only one <article> parent


<a>


<article>


all_a = soup.findAll('article a')

direct_a = [i for i in all_a if len(i)>2]



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Some of your past answers have not been well-received, and you're in danger of being blocked from answering.



Please pay close attention to the following guidance:



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)