禁止使用Java而不使用Web浏览器？

我正在编写一个小型的Java程序来获得给定Googlesearch字词的结果数量。由于某种原因，在Java中，我得到了一个403 Forbidden，但我在Web浏览器中获得了正确的结果。码：

import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; public class DataGetter { public static void main(String[] args) throws IOException { getResultAmount("test"); } private static int getResultAmount(String query) throws IOException { BufferedReader r = new BufferedReader(new InputStreamReader(new URL("https://www.google.com/search?q=" + query).openConnection() .getInputStream())); String line; String src = ""; while ((line = r.readLine()) != null) { src += line; } System.out.println(src); return 1; } }

而错误：

 Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/search?q=test at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source) at DataGetter.getResultAmount(DataGetter.java:15) at DataGetter.main(DataGetter.java:10)

为什么这样做？

您只需要设置用户代理标题即可运行：

 URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection(); connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); connection.connect(); BufferedReader r = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8"))); StringBuilder sb = new StringBuilder(); String line; while ((line = r.readLine()) != null) { sb.append(line); } System.out.println(sb.toString());

从您的exception堆栈跟踪中可以看出，SSL是透明地处理的。

获得结果量并不是真的这么简单，但是在这之后，你必须通过获取cookie和parsingredirect标记链接来伪造你是浏览器。

 String cookie = connection.getHeaderField( "Set-Cookie").split(";")[0]; Pattern pattern = Pattern.compile("content=\\\"0;url=(.*?)\\\""); Matcher m = pattern.matcher(response); if( m.find() ) { String url = m.group(1); connection = new URL(url).openConnection(); connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); connection.setRequestProperty("Cookie", cookie ); connection.connect(); r = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8"))); sb = new StringBuilder(); while ((line = r.readLine()) != null) { sb.append(line); } response = sb.toString(); pattern = Pattern.compile("<div id=\"resultStats\">About ([0-9,]+) results</div>"); m = pattern.matcher(response); if( m.find() ) { long amount = Long.parseLong(m.group(1).replaceAll(",", "")); return amount; } }

运行完整的代码，我得到了2930000000L的结果。

您可能没有设置正确的标题。在浏览器中使用LiveHttpHeaders （或同等function）来查看浏览器正在发送什么标题，然后在代码中模拟它们。

这是因为该网站使用SSL。尝试使用Jersey HTTP客户端。您可能还需要了解一些关于HTTPS和证书的知识，但是我认为，泽西岛可以打赌忽略大多数与实际安全有关的细节。

禁止使用Java而不使用Web浏览器？

403 Forbidden vs 401未经授权的HTTP响应

错误消息“禁止您无权访问/在此服务器上”

WAMP 403禁止在Windows 7上的消息

Nginx的403错误：的目录索引是被禁止的

用Python获取维基百科文章

使用Elmah MVC无法访问生产服务器上的/ elmah？

Nginx的403禁止所有文件

屏幕抓取：绕过“HTTP错误403：robots.txt不允许的请求”

MVC4 HTTP错误403.14 – 禁止

优胜美地的Apache localhost 403错误