Downloading xls/csv files using urlretrieve from Python stops

up vote
0
down vote

favorite

I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.

The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.

Sometimes, the log sends a message with the 500 error.

For the last issue my hypothesis/questions are:

The error raises due to a firewall that prevents from repeated calls to the server

Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.

The first issue is too weird and it is chained to the second one.

Here is my code:

import os
import time 
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote 



url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
 u = urlopen(url)
 try:
 html = u.read().decode('utf-8')
 finally:
 u.close()
direcciones = #to be populated with urls

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
 href = link.get('href')
 if href.startswith('javascript:'):
 continue
 filename = href.rsplit('/', 1)[-1]

 href = urljoin(url, quote(href))
 #try:
 # urlretrieve(href, filename)
 #except:
 # print('Downloading Error')

 if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
 direcciones.append(href)

# "n" adds a new line
direcciones = 'n'.join(direcciones)


#Save every element in a txt file
with open("file.txt", "w") as output:
 output.write(direcciones) 


DOWNLOADS_DIR = os.getcwd()

# For every line in the file
for url in open("file.txt"):
 time.sleep(randint(0,5))

 # Split on the rightmost / and take everything on the right side of that
 name = url.rsplit('/', 1)[-1]

 # Combine the name and the downloads directory to get the local filename
 filename = os.path.join(DOWNLOADS_DIR, name)
 filename = filename[:-1] #Quitamos el espacio en blanco al final

 # Download the file if it does not exist
 if not os.path.isfile(filename):
 urlretrieve(href, filename)

Am I not using the correct url parser?

Any ideas? Thanks!

asked Nov 8 at 20:17

jairochoa

add a comment |

up vote
0
down vote

favorite

Sometimes, the log sends a message with the 500 error.

For the last issue my hypothesis/questions are:

The error raises due to a firewall that prevents from repeated calls to the server

Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.

The first issue is too weird and it is chained to the second one.

Here is my code:

import os
import time 
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote 



url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
 u = urlopen(url)
 try:
 html = u.read().decode('utf-8')
 finally:
 u.close()
direcciones = #to be populated with urls

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
 href = link.get('href')
 if href.startswith('javascript:'):
 continue
 filename = href.rsplit('/', 1)[-1]

 href = urljoin(url, quote(href))
 #try:
 # urlretrieve(href, filename)
 #except:
 # print('Downloading Error')

 if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
 direcciones.append(href)

# "n" adds a new line
direcciones = 'n'.join(direcciones)


#Save every element in a txt file
with open("file.txt", "w") as output:
 output.write(direcciones) 


DOWNLOADS_DIR = os.getcwd()

# For every line in the file
for url in open("file.txt"):
 time.sleep(randint(0,5))

 # Split on the rightmost / and take everything on the right side of that
 name = url.rsplit('/', 1)[-1]

 # Combine the name and the downloads directory to get the local filename
 filename = os.path.join(DOWNLOADS_DIR, name)
 filename = filename[:-1] #Quitamos el espacio en blanco al final

 # Download the file if it does not exist
 if not os.path.isfile(filename):
 urlretrieve(href, filename)

Am I not using the correct url parser?

Any ideas? Thanks!

asked Nov 8 at 20:17

jairochoa

add a comment |

up vote
0
down vote

favorite

Sometimes, the log sends a message with the 500 error.

For the last issue my hypothesis/questions are:

The error raises due to a firewall that prevents from repeated calls to the server

Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.

The first issue is too weird and it is chained to the second one.

Here is my code:

import os
import time 
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote 



url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
 u = urlopen(url)
 try:
 html = u.read().decode('utf-8')
 finally:
 u.close()
direcciones = #to be populated with urls

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
 href = link.get('href')
 if href.startswith('javascript:'):
 continue
 filename = href.rsplit('/', 1)[-1]

 href = urljoin(url, quote(href))
 #try:
 # urlretrieve(href, filename)
 #except:
 # print('Downloading Error')

 if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
 direcciones.append(href)

# "n" adds a new line
direcciones = 'n'.join(direcciones)


#Save every element in a txt file
with open("file.txt", "w") as output:
 output.write(direcciones) 


DOWNLOADS_DIR = os.getcwd()

# For every line in the file
for url in open("file.txt"):
 time.sleep(randint(0,5))

 # Split on the rightmost / and take everything on the right side of that
 name = url.rsplit('/', 1)[-1]

 # Combine the name and the downloads directory to get the local filename
 filename = os.path.join(DOWNLOADS_DIR, name)
 filename = filename[:-1] #Quitamos el espacio en blanco al final

 # Download the file if it does not exist
 if not os.path.isfile(filename):
 urlretrieve(href, filename)

Am I not using the correct url parser?

Any ideas? Thanks!

asked Nov 8 at 20:17

jairochoa

Sometimes, the log sends a message with the 500 error.

For the last issue my hypothesis/questions are:

The error raises due to a firewall that prevents from repeated calls to the server

Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.

The first issue is too weird and it is chained to the second one.

Here is my code:

import os
import time 
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote 



url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
 u = urlopen(url)
 try:
 html = u.read().decode('utf-8')
 finally:
 u.close()
direcciones = #to be populated with urls

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
 href = link.get('href')
 if href.startswith('javascript:'):
 continue
 filename = href.rsplit('/', 1)[-1]

 href = urljoin(url, quote(href))
 #try:
 # urlretrieve(href, filename)
 #except:
 # print('Downloading Error')

 if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
 direcciones.append(href)

# "n" adds a new line
direcciones = 'n'.join(direcciones)


#Save every element in a txt file
with open("file.txt", "w") as output:
 output.write(direcciones) 


DOWNLOADS_DIR = os.getcwd()

# For every line in the file
for url in open("file.txt"):
 time.sleep(randint(0,5))

 # Split on the rightmost / and take everything on the right side of that
 name = url.rsplit('/', 1)[-1]

 # Combine the name and the downloads directory to get the local filename
 filename = os.path.join(DOWNLOADS_DIR, name)
 filename = filename[:-1] #Quitamos el espacio en blanco al final

 # Download the file if it does not exist
 if not os.path.isfile(filename):
 urlretrieve(href, filename)

Am I not using the correct url parser?

Any ideas? Thanks!

python-3.x beautifulsoup web-crawler http-status-code-500 urlretrieve

asked Nov 8 at 20:17

jairochoa

asked Nov 8 at 20:17

jairochoa

asked Nov 8 at 20:17

jairochoa

asked Nov 8 at 20:17

jairochoa

asked Nov 8 at 20:17

jairochoa

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

it has anti bot, you need to set browser user agent instead of default python user agent

......
import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)

url=....

and you have to replace href to url in

if not os.path.isfile(filename):
 urlretrieve(href, filename) # must be: url

edited Nov 9 at 9:33

answered Nov 9 at 8:59

ewwink

6,24622233

I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18

you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53215489%2fdownloading-xls-csv-files-using-urlretrieve-from-python-stops%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

it has anti bot, you need to set browser user agent instead of default python user agent

......
import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)

url=....

and you have to replace href to url in

if not os.path.isfile(filename):
 urlretrieve(href, filename) # must be: url

edited Nov 9 at 9:33

answered Nov 9 at 8:59

ewwink

6,24622233

I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18

you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24

add a comment |

up vote
0
down vote

accepted

it has anti bot, you need to set browser user agent instead of default python user agent

......
import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)

url=....

and you have to replace href to url in

if not os.path.isfile(filename):
 urlretrieve(href, filename) # must be: url

edited Nov 9 at 9:33

answered Nov 9 at 8:59

ewwink

6,24622233

I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18

you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24

add a comment |

up vote
0
down vote

accepted

it has anti bot, you need to set browser user agent instead of default python user agent

......
import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)

url=....

and you have to replace href to url in

if not os.path.isfile(filename):
 urlretrieve(href, filename) # must be: url

edited Nov 9 at 9:33

answered Nov 9 at 8:59

ewwink

6,24622233

it has anti bot, you need to set browser user agent instead of default python user agent

......
import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)

url=....

and you have to replace href to url in

if not os.path.isfile(filename):
 urlretrieve(href, filename) # must be: url

edited Nov 9 at 9:33

answered Nov 9 at 8:59

ewwink

6,24622233

edited Nov 9 at 9:33

answered Nov 9 at 8:59

ewwink

6,24622233

answered Nov 9 at 8:59

ewwink

6,24622233

answered Nov 9 at 8:59

ewwink

6,24622233

I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18

you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24

add a comment |

I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18

you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24

I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
– jairochoa
Nov 9 at 16:18

you're are welcome and you can accept this answer.
– ewwink
Nov 9 at 16:24

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt