How check if website support http, htts and www prefix with scrapy

I am using scrapy to check, if some website works fine, when I use http://example.com, https://example.com or http://www.example.com. When I create scrapy request, it works fine. for example, on my page1.com, it is always redirected to https://. I need to get this information as return value, or is there better way how to get this information using scrapy?

http://example.com

https://example.com

http://www.example.com

page1.com

https://

class myspider(scrapy.Spider): name = 'superspider' start_urls = [ "https://page1.com/" ] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): url = response.url # removing all possible prefixes from url for remove in ['https://', 'http://', 'www.']: url = str(url).replace(remove, '').rstrip('/') # Try with all possible prefixes for prefix in ['http://', 'http://www.', 'https://', 'https://www.']: yield scrapy.Request(url=''.format(prefix, url), callback=self.test, dont_filter=True) def test(self, response): print(response.url, response.status)

The output of this spider is this:

https://page1.com 200 https://page1.com/ 200 https://page1.com/ 200 https://page1.com/ 200

This is nice, but I would like to get this information as return value to know, that e.g. on http is response code 200 and than save it to dictionary for later processing or save it as json to file(using items in scrapy).

http

DESIRED OUTPUT:
I would like to have dictionary named a with all information:

a

print(a) 'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True

Later I would like to scrape more information, so I will need to store all information under one object/json/...

Would you be able to update a desired output?
– Joseph Seung Jae Dollar
Aug 30 at 20:32

I inserted desired output. Thank you.
– dorinand
Aug 30 at 20:39

2 Answers
2

Instead of using the meta possibility which was pointed out by eLRuLL you can parse request.url:

scrapy shell http://stackoverflow.com In [1]: request.url Out[1]: 'http://stackoverflow.com' In [2]: response.url Out[2]: 'https://stackoverflow.com/'

To store the values for different runs together in one dict/json you can use an additional pipeline like mentioned in https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter
So you have something like:

Class WriteAllRequests(object): def __init__(self): self.urldic= def process_item(self, item, spider): urldic[item.url]=item.urlprefix=item.urlstatus if len(urldic[item.url])==4: # think this can be passed to a standard pipeline with a higher number writedata (urldic[item.url]) del urldic[item.url]

You must additionally activate the pipeline

As I mentioned above, I know how to check and print it, the question is, how to get return value from function and store it , e.g. as dictionary. When I get to callback function, there can be processed code, but it do not return anything. one of the option is to store it inside e.g. to redis and when I will return from callback, I could read it and save it, it could work but according to me, it is not hte best way how to solve it.
– dorinand
Sep 1 at 13:13

callback

Ok, your issue is that you want to store the values of 4 different runs in one dict/json ... so I changed my answer
– Thomas Strub
Sep 3 at 7:28

you are doing one extra request at the beginning of the spider and you could deal with all those domains with the start_requests method:

start_requests

class myspider(scrapy.Spider): name = 'superspider' def start_requests(self): url = response.url # removing all possible prefixes from url for remove in ['https://', 'http://', 'www.']: url = str(url).replace(remove, '').rstrip('/') # Try with all possible prefixes for prefix in ['http://', 'http://www.', 'https://', 'https://www.']: yield scrapy.Request( url=''.format(prefix, url), callback=self.parse, dont_filter=True, meta='prefix': prefix, ) def parse(self, response): yield response.meta['prefix']: True

check that I am using the meta request parameter to pass the information to the next callback method on which prefix was used.

meta

And how to store response.meta['prefix']: True in some variable, lets say, variable myprefixes in start_requests function? Where is your reponse.meta yielded when you call parse function?
– dorinand
Aug 31 at 19:36

response.meta['prefix']: True

myprefixes

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt