如何通过在Python中的Tor做urllib2请求?
我正在尝试使用Python编写的抓取工具来抓取网站。 我想把Tor与Python整合在一起,这意味着我想用Tor来匿名爬取网站。
我试过这样做。 这似乎并不奏效。 我检查了我的IP,它仍然是我用tor之前的一样。 我通过python检查它。
import urllib2 proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"}) opener = urllib2.build_opener(proxy_handler) urllib2.install_opener(opener)
您尝试连接到SOCKS端口 – Tor拒绝任何非SOCKS通信。 您可以通过中间人 – Privoxy – 使用端口8118连接。
例:
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"}) opener = urllib2.build_opener(proxy_support) opener.addheaders = [('User-agent', 'Mozilla/5.0')] print opener.open('http://www.google.com').read()
另外请注意传递给ProxyHandler的属性,不在http前加ip:port
pip install PySocks
然后:
import socket import socks import urllib2 ipcheck_url = 'http://checkip.amazonaws.com/' # Actual IP. print(urllib2.urlopen(ipcheck_url).read()) # Tor IP. socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050) socket.socket = socks.socksocket print(urllib2.urlopen(ipcheck_url).read())
仅在https://stackoverflow.com/a/2015649/895245中使用;urllib2.ProxyHandler
失败:
Tor is not an HTTP Proxy
提到: 我如何在urllib2上使用SOCKS 4/5代理?
testingUbuntu 15.10,Tor 0.2.6.10,Python 2.7.10。
在tor的前面使用privoxy作为http代理为我工作 – 这里是一个履带模板:
import urllib2 import httplib from BeautifulSoup import BeautifulSoup from time import sleep class Scraper(object): def __init__(self, options, args): if options.proxy is None: options.proxy = "http://localhost:8118/" self._open = self._get_opener(options.proxy) def _get_opener(self, proxy): proxy_handler = urllib2.ProxyHandler({'http': proxy}) opener = urllib2.build_opener(proxy_handler) return opener.open def get_soup(self, url): soup = None while soup is None: try: request = urllib2.Request(url) request.add_header('User-Agent', 'foo bar useragent') soup = BeautifulSoup(self._open(request)) except (httplib.IncompleteRead, httplib.BadStatusLine, urllib2.HTTPError, ValueError, urllib2.URLError), err: sleep(1) return soup class PageType(Scraper): _URL_TEMPL = "http://foobar.com/baz/%s" def items_from_page(self, url): nextpage = None soup = self.get_soup(url) items = [] for item in soup.findAll("foo"): items.append(item["bar"]) nexpage = item["href"] return nextpage, items def get_items(self): nextpage, items = self._categories_from_page(self._START_URL % "start.html") while nextpage is not None: nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage) items.extend(newitems) return items() pt = PageType() print pt.get_items()
这里是一个代码下载使用tor代理在Python中的文件:(更新url)
import urllib2 url = "data/media/17/Donald_Duck2.gif" proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) file_name = url.split('/')[-1] u = urllib2.urlopen(url) f = open(file_name, 'wb') meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8)*(len(status)+1) print status, f.close()
下面的代码是100%在Python 3.4上工作
(你需要保持TOR浏览器打开使用此代码)
这个脚本通过socks5连接到TOR,从checkip.dyn.com得到IP,更改身份并重新发送请求以获得新的IP(循环10次)
你需要安装适当的库来使这个工作。 (享受和不要滥用)
import socks import socket import time from stem.control import Controller from stem import Signal import requests from bs4 import BeautifulSoup err = 0 counter = 0 url = "checkip.dyn.com" with Controller.from_port(port = 9151) as controller: try: controller.authenticate() socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150) socket.socket = socks.socksocket while counter < 10: r = requests.get("http://checkip.dyn.com") soup = BeautifulSoup(r.content) print(soup.find("body").text) counter = counter + 1 #wait till next identity will be available controller.signal(Signal.NEWNYM) time.sleep(controller.get_newnym_wait()) except requests.HTTPError: print("Could not reach URL") err = err + 1 print("Used " + str(counter) + " IPs and got " + str(err) + " errors")
以下解决scheme适用于Python 3 。 改编自CiroSantilli的回答 :
使用urllib
(Python 3中的urllib2的名称):
import socks import socket from urllib.request import urlopen url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = urlopen(url) print(response.read())
requests
:
import socks import socket import requests url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = requests.get(url) print(response.text)
有了Selenium
+ PhantomJS:
from selenium import webdriver url = 'http://icanhazip.com/' service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ] phantomjs_path = '/your/path/to/phantomjs' driver = webdriver.PhantomJS( executable_path=phantomjs_path, service_args=service_args) driver.get(url) print(driver.page_source) driver.close()
注意 :如果您打算经常使用Tor,请考虑捐款以支持他们的杰出工作!
更新 – 最新(高于v2.10.0) requests
库支持袜子代理与requests[socks]
的额外要求。
安装 –
pip install requests requests[socks]
基本用法 –
import requests session = requests.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP print requests.get("http://httpbin.org/ip").text
旧的答案 – 尽pipe这是一个旧的post,回答是因为没有人似乎提到了requesocks
库。
它基本上是requests
库的一个端口。 请注意,库是一个旧的叉(上次更新2013-03-25),可能不具有最新的请求库相同的function。
安装 –
pip install requesocks
基本用法 –
# Assuming that Tor is up & running import requesocks session = requesocks.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP import requests print requests.get("http://httpbin.org/ip").text
也许你有一些networking连接问题? 上面的脚本为我工作(我replace了一个不同的URL – 我使用http://stackoverflow.com/
– 我得到了预期的页面:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" > <html> <head> <title>Stack Overflow</title> <link rel="stylesheet" href="/content/all.css?v=3856">
(等等。)
Tor是一个袜子代理。 直接连接到您引用的示例失败,出现“urlopen错误隧道连接失败:501 Tor不是HTTP代理服务器”。 正如其他人所说,你可以用Privoxy解决这个问题。
或者,您也可以使用PycURL或SocksiPy。 对于使用两个tor的例子见…
https://stem.torproject.org/tutorials/to_russia_with_love.html
你可以使用torify
用你的程序运行
~$torify python your_program.py