Scrapy extracting rows from table without headers
Scrapy extracting rows from table without headers
So I'm trying to extract a table from a website. It's a two column table as follows:
Name Foo
Number Foo123
Address 10
First Drive
London
AB34 5FG
Region United Kingdom
The table doesn't have headers and the "Address" row contains blank cells in the first column for second, city, postcode, etc.
I've managed to get the table, just fine.
table = response.xpath('//table[@id="MemberDetails"]/tr/td//text()')
table = response.xpath('//table[@id="MemberDetails"]/tr/td//text()')
This is the output:
[<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Name:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFooxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Number:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFoo123xa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Address:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rn(10)xa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFirst Drivexa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnLondonxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnAB34 5FGxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnUnited Kingdomxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Region:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnUnited Kingdomxa0rn'>]
However, I'm stumped as to how I can parse the table into a proper structure.
1st Question: Not sure how I can deal with the address field.
2nd Question: This is a two column table. When saving this, I'd like to transpose such that, the "Name, Number, Address, Region" are column headings.
There are 1000's of pages like this that contain similar data.
Appreciate if someone can point me in the right direction.
@Chillie Yep. The website is poorly formatted. See link: abri.une.edu.au/online/cgi-bin/…
– Kvothe
Sep 11 '18 at 14:47
3 Answers
3
You can generate a dictionary for all rows in your table:
def parse(self, response):
table_data =
current_key = None
for tr in response.xpath('//table[@id="MemberDetails"]//tr'):
key = tr.xpath('string(./td[1])').extract_first()
value = tr.xpath('string(./td[2])').extract_first()
if key:
key = key.strip()
key = key.replace(":", "")
if value:
value = value.strip()
if key:
current_key = key
if current_key in table_data:
table_data[current_key] += 'n' + value
else:
table_data[current_key] = value
print(table_data["Address"])
thanks! this works like a charm. The other responses are good too.
– Kvothe
Sep 16 '18 at 17:50
You can do something like this:
data =
rows = response.css('table#MemberDetails tr')
for row in rows:
label = row.css('td:nth-child(1) strong::text').extract_first().strip()
value = row.css('td+td::text').extract_first().strip()
if label:
label = label.replace(':', '')
data[label] = value
else:
data['Address'] = data['Address'] + ', ' + value
print(data)
It does not work on every situation (for example, in your link Herd Completeness of Performance Rating:
label is in a <a>
tag and the value is an image), but you have a beginning of solution :)
Herd Completeness of Performance Rating:
<a>
Other solution:
name = response.css('td:contains("Membership Name:") + td::text').extract_first().strip()
...
address = response.css('td:contains("Address:") + td::text').extract_first().strip()
second = response.css('tr:contains("Address:") + tr td + td::text').extract_first().strip()
city = response.css('tr:contains("Address:") + tr + tr td + td::text').extract_first().strip()
Let's work with the sample you have provided us with. (Working with this link )
Let's look at how a table line is formatted.
<tr>
<td bgcolor="#7EADAD">
<strong>Membership Name: </strong>
</td>
<td>
MESSRS R J & L A ACTON
</td>
</tr>
This is great for us, we have different attributes for headers and values: headers are within the strong
tags within td
s with bgcolor
values are directly inside their td
s.
strong
td
bgcolor
td
Let's check out how the empty lines for the address info look:
<tr>
<td bgcolor="#7EADAD">
<strong> </strong>
</td>
<td>
NORTHUMBERLAND
</td>
</tr>
Great, same structure.
This means we can loop through all tr
s and grab their data depending on the attributes if we need to.
Here's a minimal example that does not deal with links:
tr
for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
header = tr.xpath('td/strong/text()').extract()[0].strip()
value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
print(u' -- '.format(header, value))
You next step is to collect the data (make a dictionary? separate lists? write directly to files? It's your choice.).
For dealing with the address: you could make an address variable in the loop for tr
s that is concatenated with the value if the header is empty (assuming only the address headers are empty). Something like this (again< minimal example):
tr
for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
address = ''
header = tr.xpath('td/strong/text()').extract()[0].strip()
value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
if not header: # empty strings evaluate to False
address += '' + value
You will also have to work out how to deal with headers/values that are stored within a
tags.
a
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
First I'd take a step back and ask myself if that xpath really is the best choice. Could you show us a sample of the table's html code? Do none of the table cells have any class/name/id attributes?
– Chillie
Sep 11 '18 at 14:02