How check if website support http, htts and www prefix with scrapy

How check if website support http, htts and www prefix with scrapy



I am using scrapy to check, if some website works fine, when I use http://example.com, https://example.com or http://www.example.com. When I create scrapy request, it works fine. for example, on my page1.com, it is always redirected to https://. I need to get this information as return value, or is there better way how to get this information using scrapy?


http://example.com


https://example.com


http://www.example.com


page1.com


https://


class myspider(scrapy.Spider):
name = 'superspider'

start_urls = [
"https://page1.com/"
]

def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
url = response.url
# removing all possible prefixes from url
for remove in ['https://', 'http://', 'www.']:
url = str(url).replace(remove, '').rstrip('/')

# Try with all possible prefixes
for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
yield scrapy.Request(url=''.format(prefix, url), callback=self.test, dont_filter=True)

def test(self, response):
print(response.url, response.status)



The output of this spider is this:


https://page1.com 200
https://page1.com/ 200
https://page1.com/ 200
https://page1.com/ 200



This is nice, but I would like to get this information as return value to know, that e.g. on http is response code 200 and than save it to dictionary for later processing or save it as json to file(using items in scrapy).


http



DESIRED OUTPUT:
I would like to have dictionary named a with all information:


a


print(a)
'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True



Later I would like to scrape more information, so I will need to store all information under one object/json/...





Would you be able to update a desired output?
– Joseph Seung Jae Dollar
Aug 30 at 20:32





I inserted desired output. Thank you.
– dorinand
Aug 30 at 20:39




2 Answers
2



Instead of using the meta possibility which was pointed out by eLRuLL you can parse request.url:


scrapy shell http://stackoverflow.com
In [1]: request.url
Out[1]: 'http://stackoverflow.com'

In [2]: response.url
Out[2]: 'https://stackoverflow.com/'



To store the values for different runs together in one dict/json you can use an additional pipeline like mentioned in https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter
So you have something like:


Class WriteAllRequests(object):
def __init__(self):
self.urldic=

def process_item(self, item, spider):
urldic[item.url]=item.urlprefix=item.urlstatus
if len(urldic[item.url])==4:
# think this can be passed to a standard pipeline with a higher number
writedata (urldic[item.url])

del urldic[item.url]



You must additionally activate the pipeline





As I mentioned above, I know how to check and print it, the question is, how to get return value from function and store it , e.g. as dictionary. When I get to callback function, there can be processed code, but it do not return anything. one of the option is to store it inside e.g. to redis and when I will return from callback, I could read it and save it, it could work but according to me, it is not hte best way how to solve it.
– dorinand
Sep 1 at 13:13


callback





Ok, your issue is that you want to store the values of 4 different runs in one dict/json ... so I changed my answer
– Thomas Strub
Sep 3 at 7:28




you are doing one extra request at the beginning of the spider and you could deal with all those domains with the start_requests method:


start_requests


class myspider(scrapy.Spider):
name = 'superspider'

def start_requests(self):
url = response.url
# removing all possible prefixes from url
for remove in ['https://', 'http://', 'www.']:
url = str(url).replace(remove, '').rstrip('/')

# Try with all possible prefixes
for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
yield scrapy.Request(
url=''.format(prefix, url),
callback=self.parse,
dont_filter=True,
meta='prefix': prefix,
)

def parse(self, response):
yield response.meta['prefix']: True



check that I am using the meta request parameter to pass the information to the next callback method on which prefix was used.


meta





And how to store response.meta['prefix']: True in some variable, lets say, variable myprefixes in start_requests function? Where is your reponse.meta yielded when you call parse function?
– dorinand
Aug 31 at 19:36


response.meta['prefix']: True


myprefixes



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Some of your past answers have not been well-received, and you're in danger of being blocked from answering.



Please pay close attention to the following guidance:



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)