使用selenium下载文件

我正在研究python和selenium。我想使用selenium点击事件下载文件。我写了下面的代码。

from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys browser = webdriver.Firefox() browser.get("http://www.drugcite.com/?q=ACTIMMUNE") browser.close()

我想从链接中下载这两个文件的名称“导出数据”从给定的url。我怎样才能实现它，因为它只适用于点击事件。

谢谢

使用find_element(s)_by_*查找链接，然后调用click方法。

 from selenium import webdriver # To prevent download dialog profile = webdriver.FirefoxProfile() profile.set_preference('browser.download.folderList', 2) # custom location profile.set_preference('browser.download.manager.showWhenStarting', False) profile.set_preference('browser.download.dir', '/tmp') profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv') browser = webdriver.Firefox(profile) browser.get("http://www.drugcite.com/?q=ACTIMMUNE") browser.find_element_by_id('exportpt').click() browser.find_element_by_id('exporthlgt').click()

添加configuration文件操作代码以防止下载对话框

我承认这个解决scheme比firefox个人版的saveToDisk更为“黑客”，但它可以同时适用于Chrome和Firefox，并且不依赖于随时可能改变的浏览器特定function。如果没有别的，也许这会给人一个如何解决未来挑战一点点不同的观点。

先决条件 ：确保您安装了selenium和pyvirtualdisplay …

Python 2： sudo pip install selenium pyvirtualdisplay
Python 3： sudo pip3 install selenium pyvirtualdisplay

魔术

 import pyvirtualdisplay import selenium import selenium.webdriver import time import base64 import json root_url = 'https://www.google.com' download_url = 'https://www.google.comhttp://img.dovov.combranding/googlelogo/2x/googlelogo_color_272x92dp.png' print('Opening virtual display') display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,)) display.start() print('\tDone') print('Opening web browser') driver = selenium.webdriver.Firefox() #driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try print('\tDone') print('Retrieving initial web page') driver.get(root_url) print('\tDone') print('Injecting retrieval code into web page') driver.execute_script(""" window.file_contents = null; var xhr = new XMLHttpRequest(); xhr.responseType = 'blob'; xhr.onload = function() { var reader = new FileReader(); reader.onloadend = function() { window.file_contents = reader.result; }; reader.readAsDataURL(xhr.response); }; xhr.open('GET', %(download_url)s); xhr.send(); """.replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % { 'download_url': json.dumps(download_url), }) print('Looping until file is retrieved') downloaded_file = None while downloaded_file is None: # Returns the file retrieved base64 encoded (perfect for downloading binary) downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);') print(downloaded_file) if not downloaded_file: print('\tNot downloaded, waiting...') time.sleep(0.5) print('\tDone') print('Writing file to disk') fp = open('google-logo.png', 'wb') fp.write(base64.b64decode(downloaded_file)) fp.close() print('\tDone') driver.close() # close web browser, or it'll persist after python exits. display.popen.kill() # close virtual display, or it'll persist after python exits.

释

我们首先在我们下载文件的域名上加载一个url。这使我们能够在该域上执行AJAX请求，而不会遇到跨站点脚本问题。

接下来，我们正在向DOM中注入一些JavaScript，从而引发AJAX请求。一旦AJAX请求返回一个响应，我们就将这个响应加载到一个FileReader对象中。从那里我们可以通过调用readAsDataUrl（）来提取文件的base64编码内容。然后，我们将base64编码的内容添加到window ，一个可访问的variables。

最后，由于AJAX请求是asynchronous的，我们input一个Python while循环来等待内容被附加到窗口。一旦附加，我们解码从窗口中检索到的base64内容，并将其保存到文件。

这个解决scheme应该可以在Selenium支持的所有现代浏览器中工作，并且可以在文本或二进制文件以及所有的MIMEtypes中使用。

替代方法

虽然我没有testing过这个，但是Selenium确实能够等待DOM中出现元素。在填充全局可访问的variables之前，您可以创build一个在DOM中具有特定ID的元素，并使用该元素的绑定作为触发器来检索下载的文件，而不是循环。

在Chrome中，我所做的是通过单击链接下载文件，然后打开chrome://downloads页面，然后从shadow DOM中检索下载的文件列表，如下所示：

 docs = document .querySelector('downloads-manager') .shadowRoot.querySelector('#downloads-list') .getElementsByTagName('downloads-item')

该解决scheme被限制为铬，数据还包含文件path和下载date等信息。（注意这个代码是来自JS，可能不是正确的Python语法）

使用selenium下载文件

如何获取Selenium WebDriver中的元素的文本（通过Python API），而不包括子元素文本？

如何在Selenium Webdriver（Python）中find包含特定文本的元素？

如何在Python中使用Selenium？

如何在Selenium Webdriver 2 Python中获取当前的URL？

使用selenium从textarea中清除文本