从给定的url获取域名

给定一个URL，我想提取域名（它不应该包括“www”部分）。 url可以包含http / https。这是我写的Java代码。虽然它似乎工作正常，有没有更好的方法或有一些边缘情况下，可能会失败。

public static String getDomainName(String url) throws MalformedURLException{ if(!url.startsWith("http") && !url.startsWith("https")){ url = "http://" + url; } URL netUrl = new URL(url); String host = netUrl.getHost(); if(host.startsWith("www")){ host = host.substring("www".length()+1); } return host; }

input： http ： //google.com/blah

输出：google.com

如果你想分析一个URL，使用java.net.URI 。 java.net.URL有一堆问题 – 它的equals方法进行DNS查找，这意味着使用它的代码在与不可信input一起使用时可能容易受到拒绝服务攻击。

“Gosling先生 – 你为什么让url等于吸？解释一个这样的问题。只要养成使用java.net.URI的习惯。

 public static String getDomainName(String url) throws URISyntaxException { URI uri = new URI(url); String domain = uri.getHost(); return domain.startsWith("www.") ? domain.substring(4) : domain; }

应该做你想做的。

虽然它似乎工作正常，有没有更好的方法或有一些边缘情况下，可能会失败。

您的代码写入失败的有效url：

httpfoo/bar – 具有以http开头的path组件的相对URL。
HTTP://example.com/ – 协议不区分大小写。
//example.com/ – 协议与主机的相对URL
www/foo – 具有以www开头的path组件的相对URL
wwwexample.com – 不以www.开头的域名www. 但开始于www 。

分层URL具有复杂的语法。如果您尝试在没有仔细阅读RFC 3986的情况下推出自己的parsing器，则可能会错误。只要使用核心库中内置的那个。

如果您确实需要处理java.net.URI拒绝的混乱input，请参阅RFC 3986附录B：

附录B.使用正则expression式parsingURI引用

由于“first-match-wins”algorithm与POSIX正则expression式使用的“贪婪”消歧方法相同，因此使用正则expression式来parsingURI引用的潜在五个组件是很自然和常见的。

下面一行是正确的expression式，用于将格式正确的URI引用分解为其组件。
  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9 
上面第二行的数字只是为了提高可读性。他们指出每个子expression的参考点（即每个成对的括号）。

 import java.net.*; import java.io.*; public class ParseURL { public static void main(String[] args) throws Exception { URL aURL = new URL("http://example.com:80/docs/books/tutorial" + "/index.html?name=networking#DOWNLOADING"); System.out.println("protocol = " + aURL.getProtocol()); //http System.out.println("authority = " + aURL.getAuthority()); //example.com:80 System.out.println("host = " + aURL.getHost()); //example.com System.out.println("port = " + aURL.getPort()); //80 System.out.println("path = " + aURL.getPath()); // /docs/books/tutorial/index.html System.out.println("query = " + aURL.getQuery()); //name=networking System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking System.out.println("ref = " + aURL.getRef()); //DOWNLOADING } }

这里是一个简短的使用Guava中InternetDomainName.topPrivateDomain()简单行： InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()

鉴于http://www.google.com/blah ，这会给你google.com 。或者，由于http://www.google.co.mx ，它会给你google.co.mx 。

正如Sa Qada 在这篇文章中的另一个回答中所评论的，这个问题之前已经被问到了：从给定的url中提取主域名。这个问题的最佳答案来自Satya ，他build议Guava的InternetDomainName.topPrivateDomain（）

public boolean isTopPrivateDomain（）

指示此域名是否由一个子公共组件后跟一个公共后缀组成。例如，google.com和foo.co.uk返回true，但www.google.com或co.uk不返回。

警告：此方法的真实结果并不意味着域处于可作为主机寻址的最高级别，因为许多公共后缀也是可寻址的主机。例如，域bar.uk.com具有uk.com的公共后缀，所以它将从此方法返回true。但uk.com本身就是一个可寻址的主机。

这个方法可以用来确定一个域是否可能是最高级别的cookie可以设置，虽然这取决于个别浏览器的cookie控制的实现。有关详细信息，请参阅RFC 2109。

把它与原始文章已经包含的URL.getHost()放在一起，可以给你：

 import com.google.common.net.InternetDomainName; import java.net.URL; public class DomainNameMain { public static void main(final String... args) throws Exception { final String urlString = "http://www.google.com/blah"; final URL url = new URL(urlString); final String host = url.getHost(); final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain(); System.out.println(urlString); System.out.println(host); System.out.println(name); } }

我写了一个方法（见下文），提取一个url的域名，并使用简单的string匹配。它实际上是从第一个"://" （或索引0如果没有"://"包含）和第一个后续的"/" （或索引String.length()如果没有后续的"/" ）。其余的，前面的"www(_)*." 位被切断。我相信会有这样的情况出现，但是在大多数情况下应该是足够好的！

上面的Mike Samuel的post说， java.net.URI类可以做到这一点（而且更喜欢java.net.URL类），但是我遇到了URI类的问题。值得注意的是，如果URL不包含该scheme，即URI.getHost() "http(s)"位， URI.getHost()会给出一个空值。

 /** * Extracts the domain name from {@code url} * by means of String manipulation * rather than using the {@link URI} or {@link URL} class. * * @param url is non-null. * @return the domain name within {@code url}. */ public String getUrlDomainName(String url) { String domainName = new String(url); int index = domainName.indexOf("://"); if (index != -1) { // keep everything after the "://" domainName = domainName.substring(index + 3); } index = domainName.indexOf('/'); if (index != -1) { // keep everything before the '/' domainName = domainName.substring(0, index); } // check for and remove a preceding 'www' // followed by any sequence of characters (non-greedy) // followed by a '.' // from the beginning of the string domainName = domainName.replaceFirst("^www.*?\\.", ""); return domainName; }

URI对象创build后，我做了一个小小的处理

  if (url.startsWith("http:/")) { if (!url.contains("http://")) { url = url.replaceAll("http:/", "http://"); } } else { url = "http://" + url; } URI uri = new URI(url); String domain = uri.getHost(); return domain.startsWith("www.") ? domain.substring(4) : domain;

有一个类似的问题，从给定的url提取主域名。如果你看一下这个答案，你会发现这很容易。您只需要使用java.net.URL和String实用程序 – Split

如果inputurl是用户input的。这个方法给出了最合适的主机名。如果没有find回馈的inputurl。

 private String getHostName(String urlInput) { urlInput = urlInput.toLowerCase(); String hostName=urlInput; if(!urlInput.equals("")){ if(urlInput.startsWith("http") || urlInput.startsWith("https")){ try{ URL netUrl = new URL(urlInput); String host= netUrl.getHost(); if(host.startsWith("www")){ hostName = host.substring("www".length()+1); }else{ hostName=host; } }catch (MalformedURLException e){ hostName=urlInput; } }else if(urlInput.startsWith("www")){ hostName=urlInput.substring("www".length()+1); } return hostName; }else{ return ""; } }

试试这个：java.net.URL;
JOptionPane.showMessageDialog（null，getDomainName（new URL（“ https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains ”）））;

 public String getDomainName(URL url){ String strDomain; String[] strhost = url.getHost().split(Pattern.quote(".")); String[] strTLD = {"com","org","net","int","edu","gov","mil","arpa"}; if(Arrays.asList(strTLD).indexOf(strhost[strhost.length-1])>=0) strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1]; else if(strhost.length>2) strDomain = strhost[strhost.length-3]+"."+strhost[strhost.length-2]+"."+strhost[strhost.length-1]; else strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1]; return strDomain;}

 private static final String hostExtractorRegexString = "(?:https?://)?(?:www\\.)?(.+\\.)(com|au\\.uk|co\\.in|be|in|uk|org\\.in|org|net|edu|gov|mil)"; private static final Pattern hostExtractorRegexPattern = Pattern.compile(hostExtractorRegexString); public static String getDomainName(String url){ if (url == null) return null; url = url.trim(); Matcher m = hostExtractorRegexPattern.matcher(url); if(m.find() && m.groupCount() == 2) { return m.group(1) + m.group(2); } else { return null; } }

说明：正则expression式有4组。前两个是不匹配的组，接下来的两个是匹配的组。

第一个不匹配的组是“http”或“https”或“”

第二个不匹配的组是“www”。要么 ””

第二个匹配组是顶级域名

第一个匹配组是在顶级域之前的任何非匹配组之后的任何内容

两个匹配组的连接将给我们的域名/主机名称。

PS：请注意，您可以添加任何数量的支持域到正则expression式。

从给定的url获取域名

附录B.使用正则expression式parsingURI引用

jsoup发布和cookie

如何在连接和基于行的限制（分页）中获得不同的结果？

将string拆分为string数组

为什么OpenCV的MSER的Python实现和Java实现创build不同的输出？

Enum.hashCode（）背后的原因是什么？

什么是一个有效的algorithm来找出一个单链表是否是循环的/循环的？

Java值集合对？（元组？）

为什么我的sorting循环似乎追加一个不应该的元素？

从JSON生成Java类？

Javadate – 插入数据库

从给定的url获取域名

附录B.使用正则expression式parsingURI引用

jsoup发布和cookie

如何在连接和基于行的限制（分页）中获得不同的结果？

将string拆分为string数组

为什么OpenCV的MSER的Python实现和Java实现创build不同的输出？

Enum.hashCode（）背后的原因是什么？

什么是一个有效的algorithm来找出一个单链表是否是循环的/循环的？

Java值集合对？ （元组？）

为什么我的sorting循环似乎追加一个不应该的元素？

从JSON生成Java类？

Javadate – 插入数据库

Java值集合对？（元组？）