scraping only a particular tag without details from nested tag in that particular tag
scraping only a particular tag without details from nested tag in that particular tag
I have a page where the structure is something like
<body>
<article> <!--article no 1-->
<h3>
<h2>
<h1>
<a> <!--first 'a' tag-->
<article> <!--article no 2-->
<h1>
<h2>
<a> <!--second 'a' tag-->
</article>
</article>
</body>
Now what I want is I want to extract all 'a' tags inside an article but such that no 'a' tag comes from any nested
that is
articles = browser.find_elements_by_tag_name("article")
for i in article:
print(i.find_elements_by_tag_name("a")
for first article
now i.find_elements will return all 'a' tags inside this article tag which will also include 'a' tags inside 'article tag' that is itself nested in current article tag but i dont want that
I want if i call find_elements on article no 1 'a' tags in article no 2 or in any nested article should not come
<article> <!--article no 2-->
yes I dont want details from any nested article
– Nimish Bansal
Sep 4 '18 at 7:24
Can you update the HTML with the parent tag of the
<article> <!--article no 1-->
tag?– DebanjanB
Sep 4 '18 at 7:27
<article> <!--article no 1-->
i didnt get you ?
– Nimish Bansal
Sep 4 '18 at 7:29
The parent tag/node of our desired tag i.e.
<article> <!--article no 1-->
would have been helpful– DebanjanB
Sep 4 '18 at 7:31
<article> <!--article no 1-->
3 Answers
3
If you want links from not nested articles, try:
articles = browser.find_elements_by_tag_name('article'):
for article in articles:
print(article.find_elements_by_xpath('./*[not(descendant-or-self::article)]/descendant-or-self::a'))
what if I have an article object article=articles[0]?will it work like article.find_elements_by_xpath('a[count(ancestor::article) = 1]') and will it return 'a' tags only within this article tag and not from any other article tags nested in this tag
– Nimish Bansal
Sep 4 '18 at 7:37
Yep. In this way
article.find_elements_by_xpath('.//a[count(ancestor::article) = 1]')
– Andersson
Sep 4 '18 at 7:37
article.find_elements_by_xpath('.//a[count(ancestor::article) = 1]')
yep that worked for parent article but what if I have an instance of nested article i.e article=articles[1] where articles[1] article is nested inside articles[0] and i dont want details from any other article tag nested inside articles[1]
– Nimish Bansal
Sep 4 '18 at 7:44
It should work for nested articles also. But IMHO you should initially skip that nested articles as
articles = browser.find_elements_by_xpath('//article[not(ancestor::article)]')
– Andersson
Sep 4 '18 at 7:45
articles = browser.find_elements_by_xpath('//article[not(ancestor::article)]')
I didnt get you. actually I want details of all article tags whether it is nested or not.But the point is for one particular article I dont want any details from anyother nested article inside that particular article
– Nimish Bansal
Sep 4 '18 at 7:50
Parse the article
element with BeautifulSoup
and get all the anchor tags in ease.
article
BeautifulSoup
from bs4 import BeautifulSoup
articles = browser.find_elements_by_tag_name("article")
links =
for i in articles:
soup = BeautifulSoup(i.get_attribute('outerHTML'), 'html5lib')
a_tags = soup.findAll('a')
links.extend(a_tags)
Hope this helps! Cheers!
my question does not ask for a tags inside nested articles
– Nimish Bansal
Sep 4 '18 at 8:28
Oops! My bad! Thanks for notifyiing my mistake!
– SmashGuy
Sep 4 '18 at 8:46
using BeautifulSoup,
try to find all <a>
under <articla>
like ('article a')
<a>
<articla>
then use find_parents() method of beautifulsoup.
If length of ('article a').find_parents('article') is bigger than 2, that might be nested like this.
<article>
..
<article>
..
<a>
so if you remove them you will get <a>
that has only one <article>
parent
<a>
<article>
all_a = soup.findAll('article a')
direct_a = [i for i in all_a if len(i)>2]
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Do you mean you don't want to extract anything from the nested (parent)
<article> <!--article no 2-->
tag?– DebanjanB
Sep 4 '18 at 7:24