Home How do I get a real file url in python 2.7?
Reply: 2

How do I get a real file url in python 2.7?

Dmitri
1#
Dmitri Published in 2018-01-12 13:17:47Z

I have an url http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip which "redirects" me to http://images.vbb.de/assets/ftp/file/286316.zip. Redirect in quotes because python says there is no redirect:

    In [51]: response = requests.get('http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
        ...: if response.history:
        ...:     print "Request was redirected"
        ...:     for resp in response.history:
        ...:         print resp.status_code, resp.url
        ...:     print "Final destination:"
        ...:     print response.status_code, response.url
        ...: else:
        ...:     print "Request was not redirected"
        ...:     
    Request was not redirected

Status Code is also 200. response.history gives nothing. response.url gives the first url and not the real one. But it's possible to get the real url in firefox -> developer tools -> network. How do I make in python 2.7? Thanks in advance!!

Martin Evans
2#
Martin Evans Reply to 2018-01-12 15:15:02Z

You need to first carry out the redirect manually by parsing the new window.location.href from the first returned HTML. This then creates a 301 reply with the name of the target file contained inside the Location header that is returned:

import requests
import re
import os

base_url = 'http://www.vbb.de'
response = requests.get(base_url + '/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
manual_redirect = base_url + re.findall('window.location.href\s+=\s+"(.*?)"', response.text)[0]
response = requests.get(manual_redirect, stream=True)
target_filename = response.history[0].headers['Location'].split('/')[-1]

print "Downloading: '{}'".format(target_filename)
with open(target_filename, 'wb') as f_zip:
    for chunk in response.iter_content(chunk_size=1024):
        f_zip.write(chunk)

This would display:

Downloading: '286316.zip'

and result in a 29,464,299 byte zip file being created.

Jonathon McMurray
3#
Jonathon McMurray Reply to 2018-01-12 14:58:44Z

You can use BeautifulSoup to read the meta tag in the header of the HTML page and get the redirect URL e.g.

>>> import requests
>>> from bs4 import BeautifulSoup
>>> a = requests.get("http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip")
>>> soup = BeautifulSoup(a.text, 'html.parser')
>>> soup.find_all('meta', attrs={'http-equiv': lambda x:x.lower() == 'refresh'})[0]['content'].split('URL=')[1]
'/de/download/GTFS_VBB_Nov2015_Dez2016.zip'

This URL would be relative to the original URL's domain, making the new URL http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip. Downloading this appears to download the ZIP file for me:

>>> a = requests.get("http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip", stream=True)
>>> with open('test.zip', 'wb') as f:
...     a.raw.decode_content = True
...     shutil.copyfileobj(a.raw, f)
...

 $ unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     5554  2015-11-20 15:17   agency.txt
  2151517  2015-11-20 15:17   calendar_dates.txt
    71731  2015-11-20 15:17   calendar.txt
    65424  2015-11-20 15:17   routes.txt
   816498  2015-11-20 15:17   stops.txt
196020096  2015-11-20 15:17   stop_times.txt
   365499  2015-11-20 15:17   transfers.txt
 11765292  2015-11-20 15:17   trips.txt
      113  2015-11-20 15:17   logging
---------                     -------
211261724                     9 files

It is on this redirect that there is a 301 status returned:

>>> a.history
[<Response [301]>]
>>> a
<Response [200]>
>>> a.history[0]
<Response [301]>
>>> a.history[0].url
'http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip'
>>> a.url
'http://images.vbb.de/assets/ftp/file/286316.zip'
You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.391376 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO