Downloading xls/csv files using urlretrieve from Python stops
up vote
0
down vote
favorite
I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.
The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.
Sometimes, the log sends a message with the 500 error.
For the last issue my hypothesis/questions are:
The error raises due to a firewall that prevents from repeated calls to the server
Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.
The first issue is too weird and it is chained to the second one.
Here is my code:
import os
import time
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote
url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
direcciones = #to be populated with urls
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
#try:
# urlretrieve(href, filename)
#except:
# print('Downloading Error')
if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
direcciones.append(href)
# "n" adds a new line
direcciones = 'n'.join(direcciones)
#Save every element in a txt file
with open("file.txt", "w") as output:
output.write(direcciones)
DOWNLOADS_DIR = os.getcwd()
# For every line in the file
for url in open("file.txt"):
time.sleep(randint(0,5))
# Split on the rightmost / and take everything on the right side of that
name = url.rsplit('/', 1)[-1]
# Combine the name and the downloads directory to get the local filename
filename = os.path.join(DOWNLOADS_DIR, name)
filename = filename[:-1] #Quitamos el espacio en blanco al final
# Download the file if it does not exist
if not os.path.isfile(filename):
urlretrieve(href, filename)
Am I not using the correct url parser?
Any ideas? Thanks!
python-3.x beautifulsoup web-crawler http-status-code-500 urlretrieve
add a comment |
up vote
0
down vote
favorite
I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.
The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.
Sometimes, the log sends a message with the 500 error.
For the last issue my hypothesis/questions are:
The error raises due to a firewall that prevents from repeated calls to the server
Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.
The first issue is too weird and it is chained to the second one.
Here is my code:
import os
import time
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote
url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
direcciones = #to be populated with urls
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
#try:
# urlretrieve(href, filename)
#except:
# print('Downloading Error')
if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
direcciones.append(href)
# "n" adds a new line
direcciones = 'n'.join(direcciones)
#Save every element in a txt file
with open("file.txt", "w") as output:
output.write(direcciones)
DOWNLOADS_DIR = os.getcwd()
# For every line in the file
for url in open("file.txt"):
time.sleep(randint(0,5))
# Split on the rightmost / and take everything on the right side of that
name = url.rsplit('/', 1)[-1]
# Combine the name and the downloads directory to get the local filename
filename = os.path.join(DOWNLOADS_DIR, name)
filename = filename[:-1] #Quitamos el espacio en blanco al final
# Download the file if it does not exist
if not os.path.isfile(filename):
urlretrieve(href, filename)
Am I not using the correct url parser?
Any ideas? Thanks!
python-3.x beautifulsoup web-crawler http-status-code-500 urlretrieve
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.
The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.
Sometimes, the log sends a message with the 500 error.
For the last issue my hypothesis/questions are:
The error raises due to a firewall that prevents from repeated calls to the server
Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.
The first issue is too weird and it is chained to the second one.
Here is my code:
import os
import time
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote
url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
direcciones = #to be populated with urls
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
#try:
# urlretrieve(href, filename)
#except:
# print('Downloading Error')
if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
direcciones.append(href)
# "n" adds a new line
direcciones = 'n'.join(direcciones)
#Save every element in a txt file
with open("file.txt", "w") as output:
output.write(direcciones)
DOWNLOADS_DIR = os.getcwd()
# For every line in the file
for url in open("file.txt"):
time.sleep(randint(0,5))
# Split on the rightmost / and take everything on the right side of that
name = url.rsplit('/', 1)[-1]
# Combine the name and the downloads directory to get the local filename
filename = os.path.join(DOWNLOADS_DIR, name)
filename = filename[:-1] #Quitamos el espacio en blanco al final
# Download the file if it does not exist
if not os.path.isfile(filename):
urlretrieve(href, filename)
Am I not using the correct url parser?
Any ideas? Thanks!
python-3.x beautifulsoup web-crawler http-status-code-500 urlretrieve
I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.
The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.
Sometimes, the log sends a message with the 500 error.
For the last issue my hypothesis/questions are:
The error raises due to a firewall that prevents from repeated calls to the server
Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.
The first issue is too weird and it is chained to the second one.
Here is my code:
import os
import time
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote
url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
direcciones = #to be populated with urls
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
#try:
# urlretrieve(href, filename)
#except:
# print('Downloading Error')
if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
direcciones.append(href)
# "n" adds a new line
direcciones = 'n'.join(direcciones)
#Save every element in a txt file
with open("file.txt", "w") as output:
output.write(direcciones)
DOWNLOADS_DIR = os.getcwd()
# For every line in the file
for url in open("file.txt"):
time.sleep(randint(0,5))
# Split on the rightmost / and take everything on the right side of that
name = url.rsplit('/', 1)[-1]
# Combine the name and the downloads directory to get the local filename
filename = os.path.join(DOWNLOADS_DIR, name)
filename = filename[:-1] #Quitamos el espacio en blanco al final
# Download the file if it does not exist
if not os.path.isfile(filename):
urlretrieve(href, filename)
Am I not using the correct url parser?
Any ideas? Thanks!
python-3.x beautifulsoup web-crawler http-status-code-500 urlretrieve
python-3.x beautifulsoup web-crawler http-status-code-500 urlretrieve
asked Nov 8 at 20:17
jairochoa
52
52
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
it has anti bot, you need to set browser user agent instead of default python user agent
......
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)
url=....
and you have to replace href
to url
in
if not os.path.isfile(filename):
urlretrieve(href, filename) # must be: url
I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18
you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
it has anti bot, you need to set browser user agent instead of default python user agent
......
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)
url=....
and you have to replace href
to url
in
if not os.path.isfile(filename):
urlretrieve(href, filename) # must be: url
I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18
you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24
add a comment |
up vote
0
down vote
accepted
it has anti bot, you need to set browser user agent instead of default python user agent
......
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)
url=....
and you have to replace href
to url
in
if not os.path.isfile(filename):
urlretrieve(href, filename) # must be: url
I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18
you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
it has anti bot, you need to set browser user agent instead of default python user agent
......
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)
url=....
and you have to replace href
to url
in
if not os.path.isfile(filename):
urlretrieve(href, filename) # must be: url
it has anti bot, you need to set browser user agent instead of default python user agent
......
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)
url=....
and you have to replace href
to url
in
if not os.path.isfile(filename):
urlretrieve(href, filename) # must be: url
edited Nov 9 at 9:33
answered Nov 9 at 8:59
ewwink
6,24622233
6,24622233
I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18
you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24
add a comment |
I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18
you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24
I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18
I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18
you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24
you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53215489%2fdownloading-xls-csv-files-using-urlretrieve-from-python-stops%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown