如何以更聪明的方式使用python下载文件?
我需要通过Python下载几个文件。
最明显的方法就是使用urllib2:
import urllib2 u = urllib2.urlopen('http://server.com/file.html') localFile = open('file.html', 'w') localFile.write(u.read()) localFile.close()
但是我不得不以某种方式处理那些讨厌的URL,比如说: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf
: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf
!Run.aspx/someoddtext/somemore?id=121&m= http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf
。 当通过浏览器下载时,文件具有人类可读的名字,即。 accounts.pdf
。
有没有办法在Python中处理,所以我不需要知道文件名和硬编码到我的脚本?
下载这样的脚本往往推送一个头告诉用户代理什么来命名文件:
Content-Disposition: attachment; filename="the filename.ext"
如果你能抓住头,你可以得到正确的文件名。
还有另外一个线程有一点点的代码来提供Content-Disposition
grabbing。
remotefile = urllib2.urlopen('http://example.com/somefile.zip') remotefile.info()['Content-Disposition']
根据评论和@奥利的赞誉,我做了这样的解决scheme:
from os.path import basename from urlparse import urlsplit def url2name(url): return basename(urlsplit(url)[2]) def download(url, localFileName = None): localName = url2name(url) req = urllib2.Request(url) r = urllib2.urlopen(req) if r.info().has_key('Content-Disposition'): # If the response has Content-Disposition, we take file name from it localName = r.info()['Content-Disposition'].split('filename=')[1] if localName[0] == '"' or localName[0] == "'": localName = localName[1:-1] elif r.url != url: # if we were redirected, the real file name we take from the final URL localName = url2name(r.url) if localFileName: # we can force to save the file as specified name localName = localFileName f = open(localName, 'wb') f.write(r.read()) f.close()
它从Content-Disposition获取文件名; 如果不存在,则使用URL中的文件名(如果发生redirect,则考虑最终的URL)。
结合上面的大部分内容,这是一个更为pythonic的解决scheme:
import urllib2 import shutil import urlparse import os def download(url, fileName=None): def getFileName(url,openUrl): if 'Content-Disposition' in openUrl.info(): # If the response has Content-Disposition, try to get filename from it cd = dict(map( lambda x: x.strip().split('=') if '=' in x else (x.strip(),''), openUrl.info()['Content-Disposition'].split(';'))) if 'filename' in cd: filename = cd['filename'].strip("\"'") if filename: return filename # if no filename was found above, parse it out of the final URL. return os.path.basename(urlparse.urlsplit(openUrl.url)[2]) r = urllib2.urlopen(urllib2.Request(url)) try: fileName = fileName or getFileName(url,r) with open(fileName, 'wb') as f: shutil.copyfileobj(r,f) finally: r.close()
2 Kender :
if localName[0] == '"' or localName[0] == "'": localName = localName[1:-1]
这是不安全的 – networking服务器可以通过格式错误的名称为[“file.ext]或[file.ext']甚至是空的, localName [0]会引发exception。正确的代码可以看起来像这样:
localName = localName.replace('"', '').replace("'", "") if localName == '': localName = SOME_DEFAULT_FILE_NAME
使用wget
:
custom_file_name = "/custom/path/custom_name.ext" wget.download(url, custom_file_name)
使用urlretrieve:
urllib.urlretrieve(url, custom_file_name)
如果不存在,urlretrieve也会创build目录结构。