Downloading xls/csv files using urlretrieve from Python stops









up vote
0
down vote

favorite












I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.



The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.



Sometimes, the log sends a message with the 500 error.



For the last issue my hypothesis/questions are:



  1. The error raises due to a firewall that prevents from repeated calls to the server


  2. Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.


The first issue is too weird and it is chained to the second one.



Here is my code:



import os
import time
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote



url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
direcciones = #to be populated with urls

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]

href = urljoin(url, quote(href))
#try:
# urlretrieve(href, filename)
#except:
# print('Downloading Error')

if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
direcciones.append(href)

# "n" adds a new line
direcciones = 'n'.join(direcciones)


#Save every element in a txt file
with open("file.txt", "w") as output:
output.write(direcciones)


DOWNLOADS_DIR = os.getcwd()

# For every line in the file
for url in open("file.txt"):
time.sleep(randint(0,5))

# Split on the rightmost / and take everything on the right side of that
name = url.rsplit('/', 1)[-1]

# Combine the name and the downloads directory to get the local filename
filename = os.path.join(DOWNLOADS_DIR, name)
filename = filename[:-1] #Quitamos el espacio en blanco al final

# Download the file if it does not exist
if not os.path.isfile(filename):
urlretrieve(href, filename)


Am I not using the correct url parser?



Any ideas? Thanks!










share|improve this question

























    up vote
    0
    down vote

    favorite












    I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.



    The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.



    Sometimes, the log sends a message with the 500 error.



    For the last issue my hypothesis/questions are:



    1. The error raises due to a firewall that prevents from repeated calls to the server


    2. Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.


    The first issue is too weird and it is chained to the second one.



    Here is my code:



    import os
    import time
    from random import randint
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    from urllib.request import urlopen, urlretrieve, quote



    url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
    u = urlopen(url)
    try:
    html = u.read().decode('utf-8')
    finally:
    u.close()
    direcciones = #to be populated with urls

    soup = BeautifulSoup(html)
    for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
    continue
    filename = href.rsplit('/', 1)[-1]

    href = urljoin(url, quote(href))
    #try:
    # urlretrieve(href, filename)
    #except:
    # print('Downloading Error')

    if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
    direcciones.append(href)

    # "n" adds a new line
    direcciones = 'n'.join(direcciones)


    #Save every element in a txt file
    with open("file.txt", "w") as output:
    output.write(direcciones)


    DOWNLOADS_DIR = os.getcwd()

    # For every line in the file
    for url in open("file.txt"):
    time.sleep(randint(0,5))

    # Split on the rightmost / and take everything on the right side of that
    name = url.rsplit('/', 1)[-1]

    # Combine the name and the downloads directory to get the local filename
    filename = os.path.join(DOWNLOADS_DIR, name)
    filename = filename[:-1] #Quitamos el espacio en blanco al final

    # Download the file if it does not exist
    if not os.path.isfile(filename):
    urlretrieve(href, filename)


    Am I not using the correct url parser?



    Any ideas? Thanks!










    share|improve this question























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.



      The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.



      Sometimes, the log sends a message with the 500 error.



      For the last issue my hypothesis/questions are:



      1. The error raises due to a firewall that prevents from repeated calls to the server


      2. Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.


      The first issue is too weird and it is chained to the second one.



      Here is my code:



      import os
      import time
      from random import randint
      from bs4 import BeautifulSoup
      from urllib.parse import urljoin
      from urllib.request import urlopen, urlretrieve, quote



      url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
      u = urlopen(url)
      try:
      html = u.read().decode('utf-8')
      finally:
      u.close()
      direcciones = #to be populated with urls

      soup = BeautifulSoup(html)
      for link in soup.select('div[webpartid] a'):
      href = link.get('href')
      if href.startswith('javascript:'):
      continue
      filename = href.rsplit('/', 1)[-1]

      href = urljoin(url, quote(href))
      #try:
      # urlretrieve(href, filename)
      #except:
      # print('Downloading Error')

      if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
      direcciones.append(href)

      # "n" adds a new line
      direcciones = 'n'.join(direcciones)


      #Save every element in a txt file
      with open("file.txt", "w") as output:
      output.write(direcciones)


      DOWNLOADS_DIR = os.getcwd()

      # For every line in the file
      for url in open("file.txt"):
      time.sleep(randint(0,5))

      # Split on the rightmost / and take everything on the right side of that
      name = url.rsplit('/', 1)[-1]

      # Combine the name and the downloads directory to get the local filename
      filename = os.path.join(DOWNLOADS_DIR, name)
      filename = filename[:-1] #Quitamos el espacio en blanco al final

      # Download the file if it does not exist
      if not os.path.isfile(filename):
      urlretrieve(href, filename)


      Am I not using the correct url parser?



      Any ideas? Thanks!










      share|improve this question













      I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.



      The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.



      Sometimes, the log sends a message with the 500 error.



      For the last issue my hypothesis/questions are:



      1. The error raises due to a firewall that prevents from repeated calls to the server


      2. Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.


      The first issue is too weird and it is chained to the second one.



      Here is my code:



      import os
      import time
      from random import randint
      from bs4 import BeautifulSoup
      from urllib.parse import urljoin
      from urllib.request import urlopen, urlretrieve, quote



      url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
      u = urlopen(url)
      try:
      html = u.read().decode('utf-8')
      finally:
      u.close()
      direcciones = #to be populated with urls

      soup = BeautifulSoup(html)
      for link in soup.select('div[webpartid] a'):
      href = link.get('href')
      if href.startswith('javascript:'):
      continue
      filename = href.rsplit('/', 1)[-1]

      href = urljoin(url, quote(href))
      #try:
      # urlretrieve(href, filename)
      #except:
      # print('Downloading Error')

      if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
      direcciones.append(href)

      # "n" adds a new line
      direcciones = 'n'.join(direcciones)


      #Save every element in a txt file
      with open("file.txt", "w") as output:
      output.write(direcciones)


      DOWNLOADS_DIR = os.getcwd()

      # For every line in the file
      for url in open("file.txt"):
      time.sleep(randint(0,5))

      # Split on the rightmost / and take everything on the right side of that
      name = url.rsplit('/', 1)[-1]

      # Combine the name and the downloads directory to get the local filename
      filename = os.path.join(DOWNLOADS_DIR, name)
      filename = filename[:-1] #Quitamos el espacio en blanco al final

      # Download the file if it does not exist
      if not os.path.isfile(filename):
      urlretrieve(href, filename)


      Am I not using the correct url parser?



      Any ideas? Thanks!







      python-3.x beautifulsoup web-crawler http-status-code-500 urlretrieve






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 8 at 20:17









      jairochoa

      52




      52






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          it has anti bot, you need to set browser user agent instead of default python user agent



          ......
          import urllib.request

          opener = urllib.request.build_opener()
          opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
          urllib.request.install_opener(opener)

          url=....


          and you have to replace href to url in



          if not os.path.isfile(filename):
          urlretrieve(href, filename) # must be: url





          share|improve this answer






















          • I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
            – jairochoa
            Nov 9 at 16:18










          • you're are welcome and you can accept this answer.
            – ewwink
            Nov 9 at 16:24










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53215489%2fdownloading-xls-csv-files-using-urlretrieve-from-python-stops%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote



          accepted










          it has anti bot, you need to set browser user agent instead of default python user agent



          ......
          import urllib.request

          opener = urllib.request.build_opener()
          opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
          urllib.request.install_opener(opener)

          url=....


          and you have to replace href to url in



          if not os.path.isfile(filename):
          urlretrieve(href, filename) # must be: url





          share|improve this answer






















          • I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
            – jairochoa
            Nov 9 at 16:18










          • you're are welcome and you can accept this answer.
            – ewwink
            Nov 9 at 16:24














          up vote
          0
          down vote



          accepted










          it has anti bot, you need to set browser user agent instead of default python user agent



          ......
          import urllib.request

          opener = urllib.request.build_opener()
          opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
          urllib.request.install_opener(opener)

          url=....


          and you have to replace href to url in



          if not os.path.isfile(filename):
          urlretrieve(href, filename) # must be: url





          share|improve this answer






















          • I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
            – jairochoa
            Nov 9 at 16:18










          • you're are welcome and you can accept this answer.
            – ewwink
            Nov 9 at 16:24












          up vote
          0
          down vote



          accepted







          up vote
          0
          down vote



          accepted






          it has anti bot, you need to set browser user agent instead of default python user agent



          ......
          import urllib.request

          opener = urllib.request.build_opener()
          opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
          urllib.request.install_opener(opener)

          url=....


          and you have to replace href to url in



          if not os.path.isfile(filename):
          urlretrieve(href, filename) # must be: url





          share|improve this answer














          it has anti bot, you need to set browser user agent instead of default python user agent



          ......
          import urllib.request

          opener = urllib.request.build_opener()
          opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
          urllib.request.install_opener(opener)

          url=....


          and you have to replace href to url in



          if not os.path.isfile(filename):
          urlretrieve(href, filename) # must be: url






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 9 at 9:33

























          answered Nov 9 at 8:59









          ewwink

          6,24622233




          6,24622233











          • I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
            – jairochoa
            Nov 9 at 16:18










          • you're are welcome and you can accept this answer.
            – ewwink
            Nov 9 at 16:24
















          • I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
            – jairochoa
            Nov 9 at 16:18










          • you're are welcome and you can accept this answer.
            – ewwink
            Nov 9 at 16:24















          I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
          – jairochoa
          Nov 9 at 16:18




          I made the changes, and it worked smoothly for the urls I have. Thanks a lot!!! Kudos @ewwink
          – jairochoa
          Nov 9 at 16:18












          you're are welcome and you can accept this answer.
          – ewwink
          Nov 9 at 16:24




          you're are welcome and you can accept this answer.
          – ewwink
          Nov 9 at 16:24

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53215489%2fdownloading-xls-csv-files-using-urlretrieve-from-python-stops%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

          Edmonton

          Crossroads (UK TV series)