Scrapy extracting rows from table without headers

So I'm trying to extract a table from a website. It's a two column table as follows:

Name Foo Number Foo123 Address 10 First Drive London AB34 5FG Region United Kingdom

The table doesn't have headers and the "Address" row contains blank cells in the first column for second, city, postcode, etc.

I've managed to get the table, just fine.

table = response.xpath('//table[@id="MemberDetails"]/tr/td//text()')

This is the output:

[<Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Name:xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFooxa0rn'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Number:xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFoo123xa0rn'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Address:xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rn(10)xa0rn'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnFirst Drivexa0rn'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnLondonxa0rn'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnAB34 5FGxa0rn'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnUnited Kingdomxa0rn'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'Region:xa0'>, <Selector xpath='//table[@id="MemberDetails"]/tr/td//text()' data=u'rnUnited Kingdomxa0rn'>]

However, I'm stumped as to how I can parse the table into a proper structure.

1st Question: Not sure how I can deal with the address field.
2nd Question: This is a two column table. When saving this, I'd like to transpose such that, the "Name, Number, Address, Region" are column headings.

There are 1000's of pages like this that contain similar data.

Appreciate if someone can point me in the right direction.

First I'd take a step back and ask myself if that xpath really is the best choice. Could you show us a sample of the table's html code? Do none of the table cells have any class/name/id attributes?

– Chillie
Sep 11 '18 at 14:02

@Chillie Yep. The website is poorly formatted. See link: abri.une.edu.au/online/cgi-bin/…

– Kvothe
Sep 11 '18 at 14:47

3 Answers
3

You can generate a dictionary for all rows in your table:

def parse(self, response): table_data = current_key = None for tr in response.xpath('//table[@id="MemberDetails"]//tr'): key = tr.xpath('string(./td[1])').extract_first() value = tr.xpath('string(./td[2])').extract_first() if key: key = key.strip() key = key.replace(":", "") if value: value = value.strip() if key: current_key = key if current_key in table_data: table_data[current_key] += 'n' + value else: table_data[current_key] = value print(table_data["Address"])

thanks! this works like a charm. The other responses are good too.

– Kvothe
Sep 16 '18 at 17:50

You can do something like this:

data = rows = response.css('table#MemberDetails tr') for row in rows: label = row.css('td:nth-child(1) strong::text').extract_first().strip() value = row.css('td+td::text').extract_first().strip() if label: label = label.replace(':', '') data[label] = value else: data['Address'] = data['Address'] + ', ' + value print(data)

It does not work on every situation (for example, in your link Herd Completeness of Performance Rating: label is in a <a> tag and the value is an image), but you have a beginning of solution :)

Herd Completeness of Performance Rating:

<a>

搜尋此網誌

Dfyjkt