我如何防止网站刮取?
我有一个相当大的音乐网站,一个大型的艺术家数据库。 我一直在注意到其他音乐网站刮我们的网站的数据(我input虚拟艺术家的名字,然后做谷歌search他们)。
我怎样才能防止屏幕抓取? 这甚至有可能吗?
我会假设你已经设置了robots.txt
。
正如其他人所提到的那样,刮刀几乎可以把所有的活动都弄虚作假,而且很可能很难确定那些来自坏人的请求。
我会考虑:
- 设置一个页面
/jail.html
。 - 不允许访问
robots.txt
的页面(所以恭敬的蜘蛛将永远不会访问)。 - 在你的一个页面上放置一个链接,用CSS隐藏它(
display: none
)。 - logging
/jail.html
访问者的IP地址。
这可能会帮助您快速识别明显无视robots.txt
刮板请求。
您可能还想让/jail.html
整个网站与普通网页具有相同的确切标记,但使用假数据( /jail/album/63ajdka
, /jail/track/3aads8
等)。 这样,坏的刮板就不会被提醒“不寻常的input”,直到你有机会完全阻止它们。
注意:由于此答案的完整版本超出了Stack Overflow的长度限制,因此您需要前往GitHub阅读扩展版本,并提供更多提示和详细信息。
为了阻止刮擦(也被称为网页扫描 , 屏幕抓取 , Web数据挖掘 , 网页收集或Web数据提取 ),这有助于了解这些刮板如何工作,以及阻止它们工作的原因。
有各种types的刮刀,每个工作方式不同:
-
蜘蛛,如谷歌的机器人或网站复印机,如HTtrack ,recursion地跟随其他页面的链接,以获取数据。 这些有时用于有针对性的抓取,以获得特定的数据,通常结合HTMLparsing器从每个页面提取所需的数据。
-
Shell脚本:有时,常用的Unix工具用于抓取:Wget或Curl下载页面,Grep(正则expression式)提取数据。
-
HTMLparsing器,例如基于Jsoup, Scrapy等的parsing器。 类似于基于shell脚本的正则expression式,这些工作通过基于HTML中的模式从页面提取数据,通常忽略其他的一切。
例如:如果您的网站具有searchfunction,则这样的search者可能会提交search请求,然后从结果页面HTML中获取所有结果链接及其标题,以便专门获取search结果链接及其标题。 这是最常见的。
-
Screenscrapers,基于例如。 Selenium或PhantomJS在真实浏览器中打开您的网站,运行JavaScript,AJAX等等,然后从网页中获取所需的文本,通常通过:
-
加载页面并运行JavaScript后,从浏览器获取HTML,然后使用HTMLparsing器提取所需的数据。 这些是最常见的,所以很多打破HTMLparsing器/刮板的方法也可以在这里工作。
-
截取呈现的页面,然后使用OCR从截图中提取所需的文本。 这些是很less见的,只有真正想要你的数据的专门刮板人员会设置这个。
-
-
ScrapingHub或和服等Webscraping服务。 事实上,有些人的工作是弄清楚如何刮你的网站,并提供内容供他人使用。
毫不奇怪的是,专业的抓取服务是最难阻止的,但是如果您想要弄清楚如何抓取您的网站是非常困难和耗时的,那么这些(以及支付这些服务的人)可能不会为您的网站而烦恼。
-
将网站embedded其他网站的网页中,并将网站embedded到移动应用中。
虽然不是技术上的诡计,但移动应用程序(Android和iOS)可以embedded网站,并注入自定义的CSS和JavaScript,从而彻底改变网页的外观。
-
人类复制 – 粘贴:人们将复制并粘贴您的内容,以便在其他地方使用。
这些不同种类的刮刀之间有很多重叠,即使使用不同的技术和方法,许多刮刀的performance也会类似。
这些技巧主要是我自己的想法,我在写作刮刀时遇到的各种困难,以及来自各个网站的信息和想法。
如何停止刮
你不能完全防止它 ,因为无论你做什么,被确定的刮刀仍然可以弄清楚如何刮擦。 但是,你可以通过做一些事情来阻止大量的刮擦:
监控您的日志和stream量模式; 限制访问,如果你看到不寻常的活动
定期检查您的日志,如果出现exception活动表示自动访问(刮板),例如来自相同IP地址的许多类似操作,则可以阻止或限制访问。
具体来说,一些想法:
-
限速:
只允许用户(和刮板)在一定的时间内执行有限的操作 – 例如,只允许每秒从任何特定的IP地址或用户进行几次search。 这将减缓刮板,并使其无效。 如果操作完成得太快或比实际用户快,则还可以显示validation码。
-
检测不寻常的活动:
如果您看到不寻常的活动,例如来自特定IP地址的许多类似请求,查看过多页面或执行不寻常数目search的人员,则可以阻止访问,或者为后续请求显示validation码。
-
不要只监视和限制IP地址 – 也要使用其他指标:
如果你阻止或限制速度,不要仅仅以IP地址为基础进行; 您可以使用其他指标和方法来识别特定用户或刮板。 一些指标可以帮助您识别特定用户/刮板包括:
-
用户填写表单的速度有多快,点击button的位置;
-
您可以使用JavaScript收集大量信息,如屏幕大小/分辨率,时区,安装的字体等; 你可以用它来识别用户。
-
HTTP标头及其顺序,特别是User-Agent。
举个例子,如果你从一个IP地址得到很多请求,所有使用相同的用户代理,屏幕大小(用JavaScript确定)和用户(在这个例子中是刮板)总是按照同样的方式点击button定期,这可能是一个屏幕刮板; 您可以暂时阻止类似的请求(例如,阻止来自该特定IP地址的用户代理和屏幕大小的所有请求),这样您将不会对该IP地址上的真实用户造成不便,例如, 在共享的互联网连接的情况下。
您还可以进一步采取这一行动,因为即使它们来自不同的IP地址,表示分布式抓取(使用僵尸networking或代理networking的刮板),也可以识别相似的请求。 如果您获得了很多其他相同的请求,但它们来自不同的IP地址,则可以阻止。 同样,要注意不要无意中阻止真正的用户。
这对运行JavaScript的screenscrapers是有效的,因为你可以从中得到很多信息。
安全堆栈交换的相关问题:
-
如何唯一标识具有相同外部IP地址的用户? 了解更多细节和
-
为什么当IP地址经常改变时,人们使用IP地址禁止? 了解这些方法的限制。
-
-
而不是暂时阻止访问,使用validation码:
实施限速的简单方法是临时阻止访问一段时间,但使用validation码可能会更好,请进一步查看validation码部分。
需要注册和login
如果您的网站可行,则需要创build帐户才能查看您的内容。 这对于铲运机来说是一个很好的威慑力量,但对于真正的用户来说也是一个很好的威慑力量
- 如果您需要创build和login帐户,则可以准确跟踪用户和刮板操作。 通过这种方式,您可以轻松检测到某个特定帐户正在被用于抓取,并将其禁止。 比如速率限制或检测滥用(如短时间内大量search)变得更容易,因为您可以识别特定的刮板而不仅仅是IP地址。
为了避免创build多个帐户的脚本,您应该:
-
需要注册一个电子邮件地址,并通过发送必须打开的链接来激活帐户来validation该电子邮件地址。 每个电子邮件地址只允许一个帐户。
-
在注册/创build帐户时需要validation码。
要求创build帐户以查看内容将驱动用户和search引擎; 如果您需要创build帐户才能查看文章,则用户将转到其他地方。
阻止云主机访问和刮取服务IP地址
有时,刮板将从Web托pipe服务运行,例如Amazon Web Services或GAE或VPS。 限制访问您的网站(或显示validation码)来源于这些云托pipe服务使用的IP地址的请求。
同样,您也可以限制代理或VPN提供商使用的IP地址访问,因为刮板可能使用此类代理服务器来避免检测到许多请求。
请注意,通过阻止来自代理服务器和VPN的访问,您将对实际用户产生负面影响。
如果你阻止了,使你的错误信息不伦不类
如果你阻挡/限制访问,你应该确保你不告诉刮板是什么导致了块,从而给他们提供了如何修复刮板的线索。 所以一个坏主意就是显示错误页面,如下所示:
-
来自您的IP地址的请求过多,请稍后重试。
-
错误,用户代理标题不存在!
相反,显示一个友好的错误消息,不告诉刮板是什么造成的。 像这样的东西好多了:
- 抱歉,出了一些问题。 如果问题仍然存在,您可以通过
helpdesk@example.com
与支持部门联系。
这对于真正的用户来说也是非常方便的,如果他们看到这样的错误页面的话。 你也应该考虑为后续的请求显示validation码,而不是硬块,以防真正的用户看到错误信息,这样你就不会阻止,从而导致合法用户与你联系。
如果您怀疑您的网站被刮板访问,请使用validation码。
validation码(“完全自动testing,以告诉计算机和人类分开”)是非常有效的反对停止刮板。 不幸的是,它们在刺激用户方面也非常有效。
因此,当您怀疑可能的刮板,并且想要停止刮擦时,它们是有用的,而且在不是刮板而是真实的用户的情况下也不阻挡刮板。 如果您怀疑有刮板,您可能需要考虑在允许访问内容之前显示validation码。
使用Captchas时需要注意的事项:
-
不要自己动手,使用Google的reCaptcha之类的东西 :这比自己实现一个validation码要容易得多,它比您自己想出的一些模糊和扭曲的文本解决scheme更为用户友好(用户通常只需要勾选一个框),对于脚本编写者来说,解决问题要比从站点获取简单的图片要困难得多
-
不要在HTML标记中包含validation码的解决scheme:我实际上已经看到一个网页,它拥有页面本身的validation码解决scheme(虽然隐藏得很好),因此使得它非常无用。 不要这样做。 再次,使用像reCaptcha这样的服务,你不会有这种问题(如果你正确使用它)。
-
validation码可以批量解决:有实际的,低收费的,人类解决validation码的validation码解决服务。 同样,在这里使用reCaptcha是一个好主意,因为它们有保护(例如用户解决validation码的时间相对较短)。 除非您的数据真的有价值,否则这种服务不太可能被使用。
将您的文字内容作为图片提供
您可以将文本呈现到服务器端的图像,并将其显示,这将阻碍简单的刮板提取文本。
然而,这对屏幕阅读器,search引擎,性能以及其他一切都是不利的。 这在某些地方也是非法的(由于可访问性,例如美国残疾人法案),而且用一些OCR也很容易规避,所以不要这样做。
你可以做一些类似的CSS精灵,但也有同样的问题。
不要公开您的完整数据集:
如果可行的话,不要提供脚本/机器人来获取所有数据集的方法。 举个例子:你有一个新闻网站,里面有很多单独的文章。 您可以使这些文章只能通过网站search进行search,如果您没有网站上所有文章的列表和他们的url,那么这些文章将只能通过search特征。 这意味着一个脚本想要从您的网站上获取所有文章将不得不search您的文章中可能出现的所有可能的短语,以便find它们,这将是耗时,可怕的低效率,并希望使刮刀放弃。
在下列情况下,这将是无效的:
- bot /脚本无论如何都不需要/完全需要数据集。
- 您的文章是由类似于
example.com/article.php?articleId=12345
的url提供的。 这(和类似的东西),这将允许刮板遍历所有articleId
和请求所有文章的方式。 - 还有其他方法可以最终find所有的文章,比如通过编写一个脚本来跟踪导致其他文章的文章中的链接。
- search诸如“和”或“the”之类的东西几乎可以揭示所有事情,所以这是值得注意的。 (你可以通过只返回前10或20个结果来避免这种情况)。
- 你需要search引擎来find你的内容。
不要暴露您的API,端点和类似的东西:
确保你不会暴露任何API,甚至是无意的。 例如,如果您正在使用来自Adobe Flash或Java Applets(上帝禁止!)的AJAX或networking请求加载您的数据,查看来自页面的networking请求并确定这些请求将要发送到何处是微不足道的,然后在scraper程序中反向工程并使用这些端点。 如上所述,确保您混淆了您的端点,并使其他人难以使用。
阻止HTML分析器和刮板:
由于HTMLparsing器是通过基于HTML中可识别模式的页面提取内容来工作的,所以我们可以故意改变这些模式以打破这些刮板,甚至可以将它们与之融合。 这些技巧大多也适用于其他像蜘蛛和屏幕刮板的刮板。
经常更改您的HTML
通过从HTML页面的特定的,可识别的部分提取内容,直接处理HTML的刮板可以这样做。 例如:如果您网站上的所有页面都有一个带有article-content
id的div
,其中包含文章的文本,那么编写一个脚本来访问您网站上的所有文章页面并提取内容是很简单的在每篇文章页面上的article-content
div的文本,而且刮板从您的网站上的所有文章都可以在其他地方重复使用的格式。
如果您经常更改HTML和页面结构,则此类刮板将不再有效。
-
您可以经常更改您的HTML中的元素和类的元素,甚至可以自动更改。 所以,如果你的
div.article-content
变成了div.a4c36dda13eaf0
这样的div.a4c36dda13eaf0
,并且每周都会改变,那么这个scraper会在最初工作的很好,但是会在一个星期后中断。 请务必更改您的ID /类的长度,否则刮板将使用div.[any-14-characters]
来find所需的div。 小心其他类似的漏洞 -
如果没有办法从标记中find想要的内容,那么这个刮板就会从HTML的结构中去做。 所以,如果所有的文章页面都是类似的,那么在
h1
之后的div
中的每个div
都是文章内容,那么scrapers就会根据这个文章得到文章内容。 同样,为了解决这个问题,你可以添加/删除额外的标记到你的HTML,定期和随机,例如。 添加额外的div
或span
。 现代服务器端HTML处理,这不应该太难。
需要注意的事项:
-
实施,维护和debugging将是乏味和困难的。
-
你会阻止caching。 特别是如果您更改HTML元素的id或class,则需要对CSS和JavaScript文件进行相应的更改,这意味着每次更改它们时都必须由浏览器重新下载。 这将导致重复访问者的页面加载时间延长,并增加服务器负载。 如果只是每周更换一次,这不会是一个大问题。
-
聪明的刮板仍然能够通过推断实际内容的位置来获得你的内容,例如。 通过知道页面上的大块文本可能是实际的文章。 这使得仍然可以从页面中find并提取所需的数据。 锅炉pipe正是这个。
从本质上讲,确保脚本不容易find每个类似页面的实际内容。
另请参见如何根据XPath阻止抓取工具获取页面内容,以了解如何在PHP中实现此function。
根据用户的位置更改您的HTML
这与之前的提示类似。 如果您根据用户的位置/国家/地区(由IP地址确定)提供不同的HTML,则可能会破坏交付给用户的刮板。 例如,如果有人正在编写一个从您的网站上抓取数据的移动应用程序,它将在最初正常工作,但在实际分发给用户时中断,因为这些用户可能位于不同的国家,从而获得不同的HTML,embedded式刮板不是devise用来消耗的。
经常改变你的HTML,积极与刮板拧!
示例:您的网站上有一个searchfunction,位于example.com/search?query=somesearchquery
,它返回以下HTML:
<div class="search-result"> <h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3> <p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p> <a class"search-result-link" href="/stories/story-link">Read more</a> </div> (And so on, lots more identically structured divs with search results)
正如你可能已经猜到的那样,这很容易被刮除:一个刮刀需要做的就是用查询命中searchURL,并从返回的HTML中提取所需的数据。 除了像上面描述的那样周期性地更改HTML之外,还可以将旧的标记和旧的标记保留在旧的标记中,使用CSS隐藏,并用假数据填充,从而毒害刮板。 以下是如何更改search结果页面的方法:
<div class="the-real-search-result"> <h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3> <p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p> <a class"the-real-search-result-link" href="/stories/story-link">Read more</a> </div> <div class="search-result" style="display:none"> <h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3> <p class="search-result-excerpt">Example.com is so awesome, visit now !</p> <a class"search-result-link" href="http://example.com/">Visit Now !</a> </div> (More real search results follow)
这意味着,为了从HTML中提取基于类或ID的数据而编写的刮板将继续看似工作,但它们将获得假数据甚至广告,真实用户永远不会看到的数据,因为它们被CSS隐藏起来。
用刮刀拧紧:将假的,隐形的蜜jar数据插入到页面中
join前面的例子,你可以添加隐形蜜jar项目到你的HTML来抓住刮板。 一个可以添加到前面描述的search结果页面的例子:
<div class="search-result" style=”display:none"> <h3 class="search-result-title">This search result is here to prevent scraping</h3> <p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-) Note that clicking the link below will block access to this site for 24 hours.</p> <a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a> </div> (The actual, real, search results follow.)
为获得所有search结果而编写的一个刮板将会像这个页面上的其他任何真正的search结果一样抓取它,然后访问该链接,查找所需的内容。 一个真正的人从来就不会看到它(因为它被隐藏在CSS中),并且不会访问该链接。 一个真正的和理想的蜘蛛,如谷歌将不会访问链接,因为你不允许/scrapertrap/
在你的robots.txt。
你可以让你的scrapertrap.php
为访问它的IP地址做块访问,或者为来自该IP的所有后续请求强制validation码。
-
不要忘记在你的robots.txt文件中禁止你的蜜jar(
/scrapertrap/
),这样search引擎机器人就不会落入它。 -
你可以/应该结合这个经常改变你的HTML的前一个技巧。
-
经常改变这一点,因为刮刀最终会学会避免它。 改变蜜jarurl和文本。 还想考虑更改用于隐藏的内联CSS,并使用ID属性和外部CSS代替,因为刮板将学习避免任何具有CSS
style
属性的内容用于隐藏内容。 也有时只尝试启用它,所以刮刀最初工作,但打破了一段时间。 这也适用于以前的技巧。 -
恶意的人可以通过共享蜜jar的链接来阻止真正的用户访问,甚至可以将该链接embedded到某个地方(例如在论坛上)。 经常更改url,并使任何禁止时间相对较短。
如果您检测到刮刀,则提供假的和无用的数据
如果你发现显然是一个刮板,你可以提供假的和无用的数据; 这将破坏刮板从您的网站获得的数据。 您还应该使这种假数据与真实数据无法区分,以免刮板人员不知道他们正在被拧紧。
举个例子:你有一个新闻网站, 如果你检测到一个刮板,而不是阻止访问,提供假的随机生成的文章,这将有害的数据刮板。 如果您将虚假数据与真实数据区分开来,您将难以获得所需的数据,即实际的真实数据。
如果用户代理是空的/丢失,请不要接受请求
通常,懒惰的书写刮板不会发送用户代理标题与他们的请求,而所有的浏览器,以及search引擎蜘蛛将。
如果您在用户代理标题不存在的情况下收到请求,则可以显示validation码,或者只是阻止或限制访问。 (或者如上所述提供假数据,或其他东西..)
欺骗是微不足道的,但是作为对付写得不好的刮板的措施,这是值得实施的。
如果用户代理是通用的刮板,则不要接受请求; 黑名单刮刀使用的
在某些情况下,刮板将使用一个没有真正的浏览器或search引擎蜘蛛使用的用户代理,例如:
- “Mozilla”(就是这样,除此之外,我已经看到了一些关于在这里使用的问题,一个真正的浏览器永远不会使用)
- “Java 1.7.43_u43”(默认情况下,Java的HttpUrlConnection使用这样的东西。)
- “BIZCO EasyScraping Studio 2.0”
- “wget”,“curl”,“libcurl”,..(Wget和cURL有时用于基本的抓取)
如果您发现某个特定的用户代理string被您的网站上的垃圾邮件使用,并且它不被真正的浏览器或合法的蜘蛛使用,您也可以将其添加到您的黑名单。
如果它不请求资产(CSS,图像),它不是一个真正的浏览器。
一个真正的浏览器将(几乎总是)请求和下载资产,如图像和CSS。 HTMLparsing器和刮板不会因为他们只对实际页面及其内容感兴趣。
您可以将请求logging到您的资产,并且如果您看到只有HTML的请求很多,它可能是一个刮板。
请注意,search引擎机器人,古老的移动设备,屏幕阅读器和错误configuration的设备可能不会要求资产。
使用和要求cookies; 使用它们来跟踪用户和刮板操作。
您可以要求启用Cookie才能查看您的网站。 这将阻止没有经验的新手刮刀作家,但是刮刀发送cookies很容易。 如果您确实使用并需要它们,则可以跟踪用户和刮板操作,从而对每个用户执行速率限制,阻止或显示validation码,而不是基于每个IP。
例如:当用户执行search时,设置一个唯一的识别cookie。 查看结果页面时,validation该cookie。 如果用户打开所有的search结果(你可以从cookie中知道),那么它可能是一个刮板。
使用cookie可能是无效的,因为刮板可以将cookies与他们的请求一起发送,并根据需要丢弃它们。 如果您的网站只能使用Cookie,您也可以阻止禁用Cookie的真实用户访问。
请注意,如果您使用JavaScript来设置和检索cookie,则会阻止不运行JavaScript的刮板,因为它们无法检索并发送包含请求的cookie。
使用JavaScript + Ajax加载您的内容
您可以使用JavaScript + AJAX在页面加载后加载您的内容。 这将使内容无法访问不运行JavaScript的HTMLparsing器。 对于新手和没有经验的程序员来说,这往往是一种有效的威慑手段。
意识到:
-
使用JavaScript加载实际内容会降低用户体验和性能
-
search引擎也可能不运行JavaScript,从而阻止他们索引您的内容。 这可能不是search结果页面的问题,但也可能适用于其他内容,例如文章页面。
混淆你的标记,来自脚本的networking请求,以及其他一切。
如果您使用Ajax和JavaScript加载数据,请将传输的数据混淆。 作为一个例子,你可以在服务器上编码你的数据(像base64一样简单或者更复杂),然后在通过Ajax获取之后解码并显示在客户端上。 这意味着检查networkingstream量的人不会立即看到你的页面是如何工作和加载数据的,而且有人直接从你的端点请求请求数据会更困难,因为他们将不得不逆向工程你的解扰algorithm。
-
如果你确实使用Ajax来加载数据,那么你应该很容易在不加载页面的情况下使用端点,例如通过要求一些会话密钥作为参数,你可以在JavaScript或HTML中embedded这些参数。
-
您还可以将混淆的数据直接embedded到最初的HTML页面中,并使用JavaScript进行反混淆处理并显示,从而避免额外的networking请求。 这样做会使得使用只运行JavaScript的纯HTMLparsing器提取数据变得非常困难,因为编写刮板的parsing器必须对JavaScript进行反向工程(也应该对其进行混淆)。
-
您可能需要定期更改混淆方法,以打破已经弄明白的刮板。
虽然做这样的事情有几个缺点:
-
实施,维护和debugging将是乏味和困难的。
-
对于实际运行JavaScript的刮板和屏幕抓取器来说,这将是无效的,然后提取数据。 (尽pipe大多数简单的HTMLparsing器不能运行JavaScript)
-
如果禁用了JavaScript,它将使您的网站对真实用户无效。
-
性能和页面加载时间将受到影响。
非技术:
-
告诉人们不要刮,有些人会尊重它
-
找一个律师
-
使您的数据可用,提供一个API:
你可以使你的数据容易获得,并要求归属和链接回到您的网站。 也许收取$$$。
杂:
-
还有一些商业刮刮保护服务,比如Cloudflare或Distillnetworking ( 这里介绍它是如何工作的)的反刮,这些都是为你做的。
-
Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises.
-
Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses.
-
Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.
进一步阅读:
-
Wikipedia's article on Web scraping . Many details on the technologies involved and the different types of web scraper.
-
Stopping scripters from slamming your website hundreds of times a second . Q & A on a very similar problem – bots checking a website and buying things as soon as they go on sale. A lot of relevant info, esp. on Captchas and rate-limiting.
Sue 'em.
Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.
Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.
It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)
Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.
There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IP addresses, etc. and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded – make it with image, flash, or load it with JavaScript. However, the first two are bad ideas, and the last one would be an accessibility issue if JavaScript is not enabled for some of your regular users.
If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.
There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.
Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.
This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.
Then all you have to do is convince the people who want your data to use the API. ;)
Sorry, it's really quite hard to do this…
I would suggest that you politely ask them to not use your content (if your content is copyrighted).
If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter .
Generally, whatever you do to prevent scraping will probably end up with a more negative effect, eg accessibility, bots/spiders, etc.
Okay, as all posts say, if you want to make it search engine-friendly then bots can scrape for sure.
But you can still do a few things, and it may be affective for 60-70 % scraping bots.
Make a checker script like below.
If a particular IP address is visiting very fast then after a few visits (5-10) put its IP address + browser information in a file or database.
The next step
(This would be a background process and running all time or scheduled after a few minutes.) Make one another script that will keep on checking those suspicious IP addresses.
Case 1. If the user Agent is of a known search engine like Google, Bing , Yahoo (you can find more information on user agents by googling it). Then you must see http://www.iplists.com/ . This list and try to match patterns. And if it seems like a faked user-agent then ask to fill in a CAPTCHA on the next visit. (You need to research a bit more on bots IP addresses. I know this is achievable and also try whois of the IP address. It can be helpful.)
Case 2. No user agent of a search bot: Simply ask to fill in a CAPTCHA on the next visit.
I have done a lot of web scraping and summarized some techniques to stop web scrapers on my blog based on what I find annoying.
It is a tradeoff between your users and scrapers. If you limit IP's, use CAPTCHA's, require login, etc, you make like difficult for the scrapers. But this may also drive away your genuine users.
Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses.
Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. If a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.
From a tech perspective: Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.
From a legal perspective: It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.
If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).
I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.
But, assuming you're publishing public domain information that's not copyrightable like names and basic stats… you should just let it go in the name of free speech and open data. That is, what the web's all about.
Things that might work against beginner scrapers:
- IP blocking
- use lots of ajax
- check referer request header
- require login
Things that will help in general:
- change your layout every week
- robots.txt
Things that will help but will make your users hate you:
- validation码
Late answer – and also this answer probably isn't the one you want to hear…
Myself already wrote many (many tens) of different specialized data-mining scrapers. (just because I like the "open data" philosophy).
Here are already many advices in other answers – now i will play the devil's advocate role and will extend and/or correct their effectiveness.
第一:
- if someone really wants your data
- you can't effectively (technically) hide your data
- if the data should be publicly accessible to your "regular users"
Trying to use some technical barriers aren't worth the troubles, caused:
- to your regular users by worsening their user-experience
- to regular and welcomed bots (search engines)
- 等等…
Plain HMTL – the easiest way is parse the plain HTML pages, with well defined structure and css classes. Eg it is enough to inspect element with Firebug, and use the right Xpaths, and/or CSS path in my scraper.
You could generate the HTML structure dynamically and also, you can generate dynamically the CSS class-names (and the CSS itself too) (eg by using some random class names) – but
- you want to present the informations to your regular users in consistent way
- eg again – it is enough to analyze the page structure once more to setup the scraper.
- and it can be done automatically by analyzing some "already known content"
- once someone already knows (by earlier scrape), eg:
- what contains the informations about "phil collins"
- enough display the "phil collins" page and (automatically) analyze how the page is structured "today" 🙂
You can't change the structure for every response, because your regular users will hate you. Also, this will cause more troubles for you (maintenance) not for the scraper. The XPath or CSS path is determinable by the scraping script automatically from the known content.
Ajax – little bit harder in the start, but many times speeds up the scraping process 🙂 – why?
When analyzing the requests and responses, i just setup my own proxy server (written in perl) and my firefox is using it. Of course, because it is my own proxy – it is completely hidden – the target server see it as regular browser. (So, no X-Forwarded-for and such headers). Based on the proxy logs, mostly is possible to determine the "logic" of the ajax requests, eg i could skip most of the html scraping, and just use the well-structured ajax responses (mostly in JSON format).
So, the ajax doesn't helps much…
Some more complicated are pages which uses much packed javascript functions .
Here is possible to use two basic methods:
- unpack and understand the JS and create a scraper which follows the Javascript logic (the hard way)
- or (preferably using by myself) – just using Mozilla with Mozrepl for scrape. Eg the real scraping is done in full featured javascript enabled browser, which is programmed to clicking to the right elements and just grabbing the "decoded" responses directly from the browser window.
Such scraping is slow (the scraping is done as in regular browser), but it is
- very easy to setup and use
- and it is nearly impossible to counter it 🙂
- and the "slowness" is needed anyway to counter the "blocking the rapid same IP based requests"
The User-Agent based filtering doesn't helps at all. Any serious data-miner will set it to some correct one in his scraper.
Require Login – doesn't helps. The simplest way beat it (without any analyze and/or scripting the login-protocol) is just logging into the site as regular user, using Mozilla and after just run the Mozrepl based scraper…
Remember, the require login helps for anonymous bots, but doesn't helps against someone who want scrape your data. He just register himself to your site as regular user.
Using frames isn't very effective also. This is used by many live movie services and it not very hard to beat. The frames are simply another one HTML/Javascript pages what are needed to analyze… If the data worth the troubles – the data-miner will do the required analyze.
IP-based limiting isn't effective at all – here are too many public proxy servers and also here is the TOR… 🙂 It doesn't slows down the scraping (for someone who really wants your data).
Very hard is scrape data hidden in images. (eg simply converting the data into images server-side). Employing "tesseract" (OCR) helps many times – but honestly – the data must worth the troubles for the scraper. (which many times doesn't worth).
On the other side, your users will hate you for this. Myself, (even when not scraping) hate websites which doesn't allows copy the page content into the clipboard (because the information are in the images, or (the silly ones) trying to bond to the right click some custom Javascript event. 🙂
The hardest are the sites which using java applets or flash , and the applet uses secure https requests itself internally . But think twice – how happy will be your iPhone users… ;). Therefore, currently very few sites using them. Myself, blocking all flash content in my browser (in regular browsing sessions) – and never using sites which depends on Flash.
Your milestones could be…, so you can try this method – just remember – you will probably loose some of your users. Also remember, some SWF files are decompilable. ;)
Captcha (the good ones – like reCaptcha) helps a lot – but your users will hate you… – just imagine, how your users will love you when they need solve some captchas in all pages showing informations about the music artists.
Probably don't need to continue – you already got into the picture.
Now what you should do:
Remember: It is nearly impossible to hide your data, if you on the other side want publish them (in friendly way) to your regular users.
所以,
- make your data easily accessible – by some API
- this allows the easy data access
- eg offload your server from scraping – good for you
- setup the right usage rights (eg for example must cite the source)
- remember, many data isn't copyright-able – and hard to protect them
- add some fake data (as you already done) and use legal tools
- as others already said, send an "cease and desist letter"
- other legal actions (sue and like) probably is too costly and hard to win (especially against non US sites)
Think twice before you will try to use some technical barriers.
Rather as trying block the data-miners, just add more efforts to your website usability. Your user will love you. The time (&energy) invested into technical barriers usually aren't worth – better to spend the time to make even better website…
Also, data-thieves aren't like normal thieves.
If you buy an inexpensive home alarm and add an warning "this house is connected to the police" – many thieves will not even try to break into. Because one wrong move by him – and he going to jail…
So, you investing only few bucks, but the thief investing and risk much.
But the data-thief hasn't such risks. just the opposite – ff you make one wrong move (eg if you introduce some BUG as a result of technical barriers), you will loose your users. If the the scraping bot will not work for the first time, nothing happens – the data-miner just will try another approach and/or will debug the script.
In this case, you need invest much more – and the scraper investing much less.
Just think where you want invest your time & energy…
Ps: english isn't my native – so forgive my broken english…
Sure it's possible. For 100% success, take your site offline.
In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).
You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.
I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.
There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.
However, I assume that if you don't want it scraped that means you don't want search engines to index it either.
Here are some things you can try:
- Show the text in an image. This is quite reliable, and is less of a pain on the user than a CAPTCHA, but means they won't be able to cut and paste and it won't scale prettily or be accessible.
- Use a CAPTCHA and require it to be completed before returning the page. This is a reliable method, but also the biggest pain to impose on a user.
- Require the user to sign up for an account before viewing the pages, and confirm their email address. This will be pretty effective, but not totally – a screen-scraper might set up an account and might cleverly program their script to log in for them.
- If the client's user-agent string is empty, block access. A site-scraping script will often be lazily programmed and won't set a user-agent string, whereas all web browsers will.
- You can set up a black list of known screen scraper user-agent strings as you discover them. Again, this will only help the lazily-coded ones; a programmer who knows what he's doing can set a user-agent string to impersonate a web browser.
- Change the URL path often. When you change it, make sure the old one keeps working, but only for as long as one user is likely to have their browser open. Make it hard to predict what the new URL path will be. This will make it difficult for scripts to grab it if their URL is hard-coded. It'd be best to do this with some kind of script.
If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.
- No, it's not possible to stop (in any way)
- 拥抱它。 Why not publish as RDFa and become super search engine friendly and encourage the re-use of data? People will thank you and provide credit where due (see musicbrainz as an example).
It is not the answer you probably want, but why hide what you're trying to make public?
Method One (Small Sites Only):
Serve encrypted / encoded data.
I Scape the web using python (urllib, requests, beautifulSoup etc…) and found many websites that serve encrypted / encoded data that is not decrypt-able in any programming language simply because the encryption method does not exist.
I achieved this in a PHP website by encrypting and minimizing the output (WARNING: this is not a good idea for large sites) the response was always jumbled content.
Example of minimizing output in PHP ( How to minify php page html output? ):
<?php function sanitize_output($buffer) { $search = array( '/\>[^\S ]+/s', // strip whitespaces after tags, except space '/[^\S ]+\</s', // strip whitespaces before tags, except space '/(\s)+/s' // shorten multiple whitespace sequences ); $replace = array('>', '<', '\\1'); $buffer = preg_replace($search, $replace, $buffer); return $buffer; } ob_start("sanitize_output"); ?>
Method Two:
If you can't stop them screw them over serve fake / useless data as a response.
Method Three:
block common scraping user agents, you'll see this in major / large websites as it is impossible to scrape them with "python3.4" as you User-Agent.
Method Four:
Make sure all the user headers are valid, I sometimes provide as many headers as possible to make my scraper seem like an authentic user, some of them are not even true or valid like en-FU :).
Here is a list of some of the headers I commonly provide.
headers = { "Requested-URI": "/example", "Request-Method": "GET", "Remote-IP-Address": "656.787.909.121", "Remote-IP-Port": "69696", "Protocol-version": "HTTP/1.1", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-FU,en;q=0.8", "Cache-Control": "max-age=0", "Connection": "keep-alive", "Dnt": "1", "Host": "http://example.com", "Referer": "http://example.com", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36" }
Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.
Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.
Quick approach to this would be to set a booby/bot trap.
-
Make a page that if it's opened a certain amount of times or even opened at all, will collect certain information like the IP and whatnot (you can also consider irregularities or patterns but this page shouldn't have to be opened at all).
-
Make a link to this in your page that is hidden with CSS display:none; or left:-9999px; positon:absolute; try to place it in places that are less unlikely to be ignored like where your content falls under and not your footer as sometimes bots can choose to forget about certain parts of a page.
-
In your robots.txt file set a whole bunch of disallow rules to pages you don't want friendly bots (LOL, like they have happy faces!) to gather information on and set this page as one of them.
-
Now, If a friendly bot comes through it should ignore that page. Right but that still isn't good enough. Make a couple more of these pages or somehow re-route a page to accept differnt names. and then place more disallow rules to these trap pages in your robots.txt file alongside pages you want ignored.
-
Collect the IP of these bots or anyone that enters into these pages, don't ban them but make a function to display noodled text in your content like random numbers, copyright notices, specific text strings, display scary pictures, basically anything to hinder your good content. You can also set links that point to a page which will take forever to load ie. in php you can use the sleep() function. This will fight the crawler back if it has some sort of detection to bypass pages that take way too long to load as some well written bots are set to process X amount of links at a time.
-
If you have made specific text strings/sentences why not go to your favorite search engine and search for them, it might show you where your content is ending up.
Anyway, if you think tactically and creatively this could be a good starting point. The best thing to do would be to learn how a bot works.
I'd also think about scambling some ID's or the way attributes on the page element are displayed:
<a class="someclass" href="../xyz/abc" rel="nofollow" title="sometitle">
that changes its form every time as some bots might be set to be looking for specific patterns in your pages or targeted elements.
<a title="sometitle" href="../xyz/abc" rel="nofollow" class="someclass"> id="p-12802" > id="p-00392"
You can't stop normal screen scraping. For better or worse, it's the nature of the web.
You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache . I assume it wouldn't be too difficult to do in IIS as well.
One way would be to serve the content as XML attributes, URL encoded strings, preformatted text with HTML encoded JSON, or data URIs, then transform it to HTML on the client. Here are a few sites which do this:
-
Skechers : XML
<document filename="" height="" width="" title="SKECHERS" linkType="" linkUrl="" imageMap="" href="http://www.bobsfromskechers.com" alt="BOBS from Skechers" title="BOBS from Skechers" />
-
Chrome Web Store : JSON
<script type="text/javascript" src="https://apis.google.com/js/plusone.js">{"lang": "en", "parsetags": "explicit"}</script>
-
Bing News : data URL
<script type="text/javascript"> //<![CDATA[ (function() { var x;x=_ge('emb7'); if(x) { x.src='*...*/'; } }() )
-
Protopage : URL Encoded Strings
unescape('Rolling%20Stone%20%3a%20Rock%20and%20Roll%20Daily')
-
TiddlyWiki : HTML Entities + preformatted JSON
<pre> {"tiddlers": { "GettingStarted": { "title": "GettingStarted", "text": "Welcome to TiddlyWiki, } } } </pre>
-
Amazon : Lazy Loading
amzn.copilot.jQuery=i;amzn.copilot.jQuery(document).ready(function(){d(b);f(c,function() {amzn.copilot.setup({serviceEndPoint:h.vipUrl,isContinuedSession:true})})})},f=function(i,h){var j=document.createElement("script");j.type="text/javascript";j.src=i;j.async=true;j.onload=h;a.appendChild(j)},d=function(h){var i=document.createElement("link");i.type="text/css";i.rel="stylesheet";i.href=h;a.appendChild(i)}})(); amzn.copilot.checkCoPilotSession({jsUrl : 'http://z-ecx.images-amazon.comhttp://img.dovov.comG/01/browser-scripts/cs-copilot-customer-js/cs-copilot-customer-js-min-1875890922._V1_.js', cssUrl : 'http://z-ecx.images-amazon.comhttp://img.dovov.comG/01/browser-scripts/cs-copilot-customer-css/cs-copilot-customer-css-min-2367001420._V1_.css', vipUrl : 'https://copilot.amazon.com'
-
XMLCalabash : Namespaced XML + Custom MIME type + Custom File extension
<p:declare-step type="pxp:zip"> <p:input port="source" sequence="true" primary="true"/> <p:input port="manifest"/> <p:output port="result"/> <p:option name="href" required="true" cx:type="xsd:anyURI"/> <p:option name="compression-method" cx:type="stored|deflated"/> <p:option name="compression-level" cx:type="smallest|fastest|default|huffman|none"/> <p:option name="command" select="'update'" cx:type="update|freshen|create|delete"/> </p:declare-step>
If you view source on any of the above, you see that scraping will simply return metadata and navigation.
I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability… It depends on how well you want your site to rank on search engines of course.
Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.
If you want to see a great example, check out http://www.bkstr.com/ . They use aj/s algorithm to set a cookie, then reloads the page so it can use the cookie to validate that the request is being run within a browser. A desktop app built to scrape could definitely get by this, but it would stop most cURL type scraping.
Most have been already said, but have you considered the CloudFlare protection? I mean this:
Other companies probably do this too, CloudFlare is the only one I know.
I'm pretty sure that would complicate their work. I also once got IP banned automatically for 4 months when I tried to scrap data of a site protected by CloudFlare due to rate limit (I used simple AJAX request loop).
Screen scrapers work by processing HTML. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally it's already been pointed out you may have some recourse though and that would be my recommendation.
However, you can hide the critical part of your data by using non-HTML-based presentation logic
- Generate a Flash file for each artist/album, etc.
- Generate an image for each artist content. Maybe just an image for the artist name, etc. would be enough. Do this by rendering the text onto a JPEG / PNG file on the server and linking to that image.
Bear in mind that this would probably affect your search rankings.
Generate the HTML, CSS and JavaScript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.