我怎样才能从使用Python的HTML获得href链接？

import urllib2 website = "WEBSITE" openwebsite = urllib2.urlopen(website) html = getwebsite.read() print html

到现在为止还挺好。

但我只想从纯文本HTML链接href。我该如何解决这个问题？

试试Beautifulsoup ：

 from BeautifulSoup import BeautifulSoup import urllib2 import re html_page = urllib2.urlopen("http://www.yourwebsite.com") soup = BeautifulSoup(html_page) for link in soup.findAll('a'): print link.get('href')

如果你只想要以http://开头的链接，你应该使用：

 soup.findAll('a', attrs={'href': re.compile("^http://")})

您可以使用HTMLParser模块。

代码可能看起来像这样：

 from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): # Only parse the 'anchor' tag. if tag == "a": # Check the list of defined attributes. for name, value in attrs: # If href is defined, print it. if name == "href": print name, "=", value parser = MyHTMLParser() parser.feed(your_html_string)

注意： 在Python 3.0中，HTMLParser模块已经重命名为html.parser。 2to3工具将自动适应导入将源代码转换为3.0。

看看使用美丽的汤姆HTMLparsing库。

http://www.crummy.com/software/BeautifulSoup/

你会做这样的事情：

 import BeautifulSoup soup = BeautifulSoup.BeautifulSoup(html) for link in soup.findAll("a"): print link.get("href")

我的答案可能比真正的大师更糟糕，但使用一些简单的math，string切片，查找和urllib，这个小脚本将创build一个包含链接元素的列表。我testing谷歌和我的输出似乎是正确的。希望它有帮助！

 import urllib test = urllib.urlopen("http://www.google.com").read() sane = 0 needlestack = [] while sane == 0: curpos = test.find("href") if curpos >= 0: testlen = len(test) test = test[curpos:testlen] curpos = test.find('"') testlen = len(test) test = test[curpos+1:testlen] curpos = test.find('"') needle = test[0:curpos] if needle.startswith("http" or "www"): needlestack.append(needle) else: sane = 1 for item in needlestack: print item

这是@ stephen的答案的懒惰版本

 from urllib.request import urlopen from itertools import chain from html.parser import HTMLParser class LinkParser(HTMLParser): def reset(self): HTMLParser.reset(self) self.links = iter([]) def handle_starttag(self, tag, attrs): if tag == 'a': for name, value in attrs: if name == 'href': self.links = chain(self.links, [value]) def gen_links(f, parser): encoding = f.headers.get_content_charset() or 'UTF-8' for line in f: parser.feed(line.decode(encoding)) yield from parser.links

像这样使用它：

 >>> parser = LinkParser() >>> f = urlopen('http://stackoverflow.com/questions/3075550') >>> links = gen_links(f, parser) >>> next(links) '//stackoverflow.com'

使用BS4这个特定的任务似乎矫枉过正。

改为：

 website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/') html = website.read() files = re.findall('href="(.*tgz|.*tar.gz)"', html) print sorted(x for x in (files))

我在http://www.pythonforbeginners.com/code/regular-expression-re-findall上find了这个漂亮的代码片段，对我很有帮助。;

我只testing了一个从Web文件夹中提取文件列表的scheme，该文件夹公开了files \文件夹，例如：

在这里输入图像描述

我得到了URL下的文件\文件夹的sorting列表

我怎样才能从使用Python的HTML获得href链接？

我如何做一个链接，不去哪里

jquery如何触发href元素上的点击事件

如何使UITextView检测网站，邮件和电话号码的链接

HTML图标不会显示在谷歌浏览器

从TextView中的链接删除下划线 – Android

链接到Javadoc中的外部URL？

使用javascript / jquery模拟点击“a”元素

如何将target =“_ blank”添加到指定div中的链接？

链接到重新加载当前页面

jQuery超链接 – href值？