Scrapy extracting rows from table without headers

Scrapy extracting rows from table without headers



So I'm trying to extract a table from a website. It's a two column table as follows:


Name Foo
Number Foo123
Address 10
First Drive
London
AB34 5FG
Region United Kingdom



The table doesn't have headers and the "Address" row contains blank cells in the first column for second, city, postcode, etc.



I've managed to get the table, just fine.



table = response.xpath('//table[@id="MemberDetails"]/tr/td//text()')


table = response.xpath('//table[@id="MemberDetails"]/tr/td//text()')



This is the output:


[<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Name:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFooxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Number:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFoo123xa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Address:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rn(10)xa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFirst Drivexa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnLondonxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnAB34 5FGxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnUnited Kingdomxa0rn'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Region:xa0'>,
<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnUnited Kingdomxa0rn'>]



However, I'm stumped as to how I can parse the table into a proper structure.



1st Question: Not sure how I can deal with the address field.
2nd Question: This is a two column table. When saving this, I'd like to transpose such that, the "Name, Number, Address, Region" are column headings.



There are 1000's of pages like this that contain similar data.



Appreciate if someone can point me in the right direction.






First I'd take a step back and ask myself if that xpath really is the best choice. Could you show us a sample of the table's html code? Do none of the table cells have any class/name/id attributes?

– Chillie
Sep 11 '18 at 14:02







@Chillie Yep. The website is poorly formatted. See link: abri.une.edu.au/online/cgi-bin/…

– Kvothe
Sep 11 '18 at 14:47





3 Answers
3



You can generate a dictionary for all rows in your table:


def parse(self, response):

table_data =
current_key = None

for tr in response.xpath('//table[@id="MemberDetails"]//tr'):

key = tr.xpath('string(./td[1])').extract_first()
value = tr.xpath('string(./td[2])').extract_first()

if key:
key = key.strip()
key = key.replace(":", "")
if value:
value = value.strip()

if key:
current_key = key

if current_key in table_data:
table_data[current_key] += 'n' + value
else:
table_data[current_key] = value

print(table_data["Address"])






thanks! this works like a charm. The other responses are good too.

– Kvothe
Sep 16 '18 at 17:50



You can do something like this:


data =
rows = response.css('table#MemberDetails tr')
for row in rows:
label = row.css('td:nth-child(1) strong::text').extract_first().strip()
value = row.css('td+td::text').extract_first().strip()
if label:
label = label.replace(':', '')
data[label] = value
else:
data['Address'] = data['Address'] + ', ' + value
print(data)



It does not work on every situation (for example, in your link Herd Completeness of Performance Rating: label is in a <a> tag and the value is an image), but you have a beginning of solution :)


Herd Completeness of Performance Rating:


<a>



Other solution:


name = response.css('td:contains("Membership Name:") + td::text').extract_first().strip()
...
address = response.css('td:contains("Address:") + td::text').extract_first().strip()
second = response.css('tr:contains("Address:") + tr td + td::text').extract_first().strip()
city = response.css('tr:contains("Address:") + tr + tr td + td::text').extract_first().strip()



Let's work with the sample you have provided us with. (Working with this link )



Let's look at how a table line is formatted.


<tr>
<td bgcolor="#7EADAD">
<strong>Membership Name:&nbsp;</strong>
</td>
<td>
MESSRS R J &amp; L A ACTON&nbsp;
</td>
</tr>



This is great for us, we have different attributes for headers and values: headers are within the strong tags within tds with bgcolor values are directly inside their tds.


strong


td


bgcolor


td



Let's check out how the empty lines for the address info look:


<tr>
<td bgcolor="#7EADAD">
<strong>&nbsp;</strong>
</td>
<td>
NORTHUMBERLAND&nbsp;
</td>
</tr>



Great, same structure.



This means we can loop through all trs and grab their data depending on the attributes if we need to.
Here's a minimal example that does not deal with links:


tr


for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
header = tr.xpath('td/strong/text()').extract()[0].strip()
value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
print(u' -- '.format(header, value))



You next step is to collect the data (make a dictionary? separate lists? write directly to files? It's your choice.).



For dealing with the address: you could make an address variable in the loop for trs that is concatenated with the value if the header is empty (assuming only the address headers are empty). Something like this (again< minimal example):


tr


for tr in response.xpath('//table[@id="MemberDetails"]/tr'):
address = ''
header = tr.xpath('td/strong/text()').extract()[0].strip()
value = tr.xpath('td')[1].xpath('text()').extract()[0].strip()
if not header: # empty strings evaluate to False
address += '' + value



You will also have to work out how to deal with headers/values that are stored within a tags.


a



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)