scraping only a particular tag without details from nested tag in that particular tag

I have a page where the structure is something like

<body> <article>  <h3> <h2> <h1> <a>  <article>  <h1> <h2> <a>  </article> </article> </body>

Now what I want is I want to extract all 'a' tags inside an article but such that no 'a' tag comes from any nested

that is

articles = browser.find_elements_by_tag_name("article") for i in article: print(i.find_elements_by_tag_name("a")

for first article
now i.find_elements will return all 'a' tags inside this article tag which will also include 'a' tags inside 'article tag' that is itself nested in current article tag but i dont want that

I want if i call find_elements on article no 1 'a' tags in article no 2 or in any nested article should not come

Do you mean you don't want to extract anything from the nested (parent) <article>  tag?
– DebanjanB
Sep 4 '18 at 7:24

<article> 

yes I dont want details from any nested article
– Nimish Bansal
Sep 4 '18 at 7:24

Can you update the HTML with the parent tag of the <article>  tag?
– DebanjanB
Sep 4 '18 at 7:27

<article> 

i didnt get you ?
– Nimish Bansal
Sep 4 '18 at 7:29

The parent tag/node of our desired tag i.e. <article>  would have been helpful
– DebanjanB
Sep 4 '18 at 7:31

<article> 

3 Answers
3

If you want links from not nested articles, try:

articles = browser.find_elements_by_tag_name('article'): for article in articles: print(article.find_elements_by_xpath('./*[not(descendant-or-self::article)]/descendant-or-self::a'))

what if I have an article object article=articles[0]?will it work like article.find_elements_by_xpath('a[count(ancestor::article) = 1]') and will it return 'a' tags only within this article tag and not from any other article tags nested in this tag
– Nimish Bansal
Sep 4 '18 at 7:37

Yep. In this way article.find_elements_by_xpath('.//a[count(ancestor::article) = 1]')
– Andersson
Sep 4 '18 at 7:37

article.find_elements_by_xpath('.//a[count(ancestor::article) = 1]')

yep that worked for parent article but what if I have an instance of nested article i.e article=articles[1] where articles[1] article is nested inside articles[0] and i dont want details from any other article tag nested inside articles[1]
– Nimish Bansal
Sep 4 '18 at 7:44

It should work for nested articles also. But IMHO you should initially skip that nested articles as articles = browser.find_elements_by_xpath('//article[not(ancestor::article)]')
– Andersson
Sep 4 '18 at 7:45

articles = browser.find_elements_by_xpath('//article[not(ancestor::article)]')

I didnt get you. actually I want details of all article tags whether it is nested or not.But the point is for one particular article I dont want any details from anyother nested article inside that particular article
– Nimish Bansal
Sep 4 '18 at 7:50

Parse the article element with BeautifulSoup and get all the anchor tags in ease.

article

BeautifulSoup

from bs4 import BeautifulSoup articles = browser.find_elements_by_tag_name("article") links = for i in articles: soup = BeautifulSoup(i.get_attribute('outerHTML'), 'html5lib') a_tags = soup.findAll('a') links.extend(a_tags)

Hope this helps! Cheers!

my question does not ask for a tags inside nested articles
– Nimish Bansal
Sep 4 '18 at 8:28

Oops! My bad! Thanks for notifyiing my mistake!
– SmashGuy
Sep 4 '18 at 8:46

using BeautifulSoup,

try to find all <a> under <articla> like ('article a')

<a>

<articla>

then use find_parents() method of beautifulsoup.

If length of ('article a').find_parents('article') is bigger than 2, that might be nested like this.

<article> .. <article> .. <a>

so if you remove them you will get <a> that has only one <article> parent

<a>

<article>

all_a = soup.findAll('article a') direct_a = [i for i in all_a if len(i)>2]

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt