How check if website support http, htts and www prefix with scrapy
How check if website support http, htts and www prefix with scrapy
I am using scrapy to check, if some website works fine, when I use http://example.com
, https://example.com
or http://www.example.com
. When I create scrapy request, it works fine. for example, on my page1.com
, it is always redirected to https://
. I need to get this information as return value, or is there better way how to get this information using scrapy?
http://example.com
https://example.com
http://www.example.com
page1.com
https://
class myspider(scrapy.Spider):
name = 'superspider'
start_urls = [
"https://page1.com/"
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
url = response.url
# removing all possible prefixes from url
for remove in ['https://', 'http://', 'www.']:
url = str(url).replace(remove, '').rstrip('/')
# Try with all possible prefixes
for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
yield scrapy.Request(url=''.format(prefix, url), callback=self.test, dont_filter=True)
def test(self, response):
print(response.url, response.status)
The output of this spider is this:
https://page1.com 200
https://page1.com/ 200
https://page1.com/ 200
https://page1.com/ 200
This is nice, but I would like to get this information as return value to know, that e.g. on http
is response code 200 and than save it to dictionary for later processing or save it as json to file(using items in scrapy).
http
DESIRED OUTPUT:
I would like to have dictionary named a
with all information:
a
print(a)
'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True
Later I would like to scrape more information, so I will need to store all information under one object/json/...
I inserted desired output. Thank you.
– dorinand
Aug 30 at 20:39
2 Answers
2
Instead of using the meta possibility which was pointed out by eLRuLL you can parse request.url:
scrapy shell http://stackoverflow.com
In [1]: request.url
Out[1]: 'http://stackoverflow.com'
In [2]: response.url
Out[2]: 'https://stackoverflow.com/'
To store the values for different runs together in one dict/json you can use an additional pipeline like mentioned in https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter
So you have something like:
Class WriteAllRequests(object):
def __init__(self):
self.urldic=
def process_item(self, item, spider):
urldic[item.url]=item.urlprefix=item.urlstatus
if len(urldic[item.url])==4:
# think this can be passed to a standard pipeline with a higher number
writedata (urldic[item.url])
del urldic[item.url]
You must additionally activate the pipeline
As I mentioned above, I know how to check and print it, the question is, how to get return value from function and store it , e.g. as dictionary. When I get to
callback
function, there can be processed code, but it do not return anything. one of the option is to store it inside e.g. to redis and when I will return from callback, I could read it and save it, it could work but according to me, it is not hte best way how to solve it.– dorinand
Sep 1 at 13:13
callback
Ok, your issue is that you want to store the values of 4 different runs in one dict/json ... so I changed my answer
– Thomas Strub
Sep 3 at 7:28
you are doing one extra request at the beginning of the spider and you could deal with all those domains with the start_requests
method:
start_requests
class myspider(scrapy.Spider):
name = 'superspider'
def start_requests(self):
url = response.url
# removing all possible prefixes from url
for remove in ['https://', 'http://', 'www.']:
url = str(url).replace(remove, '').rstrip('/')
# Try with all possible prefixes
for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
yield scrapy.Request(
url=''.format(prefix, url),
callback=self.parse,
dont_filter=True,
meta='prefix': prefix,
)
def parse(self, response):
yield response.meta['prefix']: True
check that I am using the meta
request parameter to pass the information to the next callback method on which prefix was used.
meta
And how to store
response.meta['prefix']: True
in some variable, lets say, variable myprefixes
in start_requests function? Where is your reponse.meta yielded when you call parse function?– dorinand
Aug 31 at 19:36
response.meta['prefix']: True
myprefixes
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Would you be able to update a desired output?
– Joseph Seung Jae Dollar
Aug 30 at 20:32