IDC服务

屏蔽AI蜘蛛和防止网站文章采集

1276
0

方法一:域名DNS托管到cloudflare,一键屏蔽AI爬虫

如果访问不了cloudflare,那就需要自己搞定梯子。
(国内域名几乎不影响访问速度,有些人会觉得使用国内DNS速度快,其实速度差不多)
方法二:宝塔防火墙设置屏蔽AI爬虫(我用的是破解版宝塔,免费版不知道能不能设置)

https://www.4414.cn/static/image/common/codebg.gif";) 0px 0px repeat-y rgb(247, 247, 247); color: rgb(102, 102, 102); zoom: 1; border: 1px solid rgb(204, 204, 204); font-family: "microsoft yahei", Tahoma, Helvetica, SimSun, sans-serif; font-size: 14px; text-wrap: wrap;">
  1. Amazonbot

  2. ClaudeBot

  3. PetalBot

  4. gptbot

  5. Ahrefs

  6. Semrush

  7. Imagesift

  8. Teoma

  9. ia_archiver

  10. twiceler

  11. MSNBot

  12. Scrubby

  13. Robozilla

  14. Gigabot

  15. yahoo-mmcrawler

  16. yahoo-blogs/v3.9

  17. psbot

  18. Scrapy

  19. SemrushBot

  20. AhrefsBot

  21. Applebot

  22. AspiegelBot

  23. DotBot

  24. DataForSeoBot

  25. java

  26. MJ12bot

  27. python

  28. seo

  29. Censys

复制代码



方法三:复制下面的代码,保存为robots.txt,上传到网站根目录

https://www.4414.cn/static/image/common/codebg.gif";) 0px 0px repeat-y rgb(247, 247, 247); color: rgb(102, 102, 102); zoom: 1; border: 1px solid rgb(204, 204, 204); font-family: "microsoft yahei", Tahoma, Helvetica, SimSun, sans-serif; font-size: 14px; text-wrap: wrap;">
  1. User-agent: Ahrefs

  2. Disallow: /

  3. User-agent: Semrush

  4. Disallow: /

  5. User-agent: Imagesift

  6. Disallow: /

  7. User-agent: Amazonbot

  8. Disallow: /

  9. User-agent: gptbot

  10. Disallow: /

  11. User-agent: ClaudeBot

  12. Disallow: /

  13. User-agent: PetalBot

  14. Disallow: /

  15. User-agent: Baiduspider

  16. Disallow:

  17. User-agent: Sosospider

  18. Disallow:

  19. User-agent: sogou spider

  20. Disallow:

  21. User-agent: YodaoBot

  22. Disallow:

  23. User-agent: Googlebot

  24. Disallow:

  25. User-agent: Bingbot

  26. Disallow:

  27. User-agent: Slurp

  28. Disallow:

  29. User-agent: Teoma

  30. Disallow: /

  31. User-agent: ia_archiver

  32. Disallow: /

  33. User-agent: twiceler

  34. Disallow: /

  35. User-agent: MSNBot

  36. Disallow: /

  37. User-agent: Scrubby

  38. Disallow: /

  39. User-agent: Robozilla

  40. Disallow: /

  41. User-agent: Gigabot

  42. Disallow: /

  43. User-agent: googlebot-image

  44. Disallow:

  45. User-agent: googlebot-mobile

  46. Disallow:

  47. User-agent: yahoo-mmcrawler

  48. Disallow: /

  49. User-agent: yahoo-blogs/v3.9

  50. Disallow: /

  51. User-agent: psbot

  52. Disallow:

  53. User-agent: dotbot

  54. Disallow: /

复制代码



方法四:防止网站被采集(宝塔配置文件保存以下代码)

https://www.4414.cn/static/image/common/codebg.gif";) 0px 0px repeat-y rgb(247, 247, 247); color: rgb(102, 102, 102); zoom: 1; border: 1px solid rgb(204, 204, 204); font-family: "microsoft yahei", Tahoma, Helvetica, SimSun, sans-serif; font-size: 14px; text-wrap: wrap;">
  1. #禁止Scrapy等工具的抓取

  2. if ($http_user_agent ~* (Scrapy|Curl|HttpClient|crawl|curb|git|Wtrace)) {


  3.      return 403;


  4. }


  5. #禁止指定UA及UA为空的访问

  6. if ($http_user_agent ~* "CheckMarkNetwork|Synapse|Nimbostratus-Bot|Dark|scraper|LMAO|Hakai|Gemini|Wappalyzer|masscan|crawler4j|Mappy|Center|eright|aiohttp|MauiBot|Crawler|researchscan|Dispatch|AlphaBot|Census|ips-agent|NetcraftSurveyAgent|ToutiaoSpider|EasyHttp|Iframely|sysscan|fasthttp|muhstik|DeuSu|mstshash|HTTP_Request|ExtLinksBot|package|SafeDNSBot|CPython|SiteExplorer|SSH|MegaIndex|BUbiNG|CCBot|NetTrack|Digincore|aiHitBot|SurdotlyBot|null|SemrushBot|Test|Copied|ltx71|Nmap|DotBot|AdsBot|InetURL|Pcore-HTTP|PocketParser|Wotbox|newspaper|DnyzBot|redback|PiplBot|SMTBot|WinHTTP|Auto Spider 1.0|GrabNet|TurnitinBot|Go-Ahead-Got-It|Download Demon|Go!Zilla|GetWeb!|GetRight|libwww-perl|Cliqzbot|MailChimp|SMTBot|Dataprovider|XoviBot|linkdexbot|SeznamBot|Qwantify|spbot|evc-batch|zgrab|Go-http-client|FeedDemon|Jullo|Feedly|YandexBot|oBot|FlightDeckReports|Linguee Bot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|EasouSpider|LinkpadBot|Ezooms|^$" ) {


  7.      return 403;


  8. }


  9. #禁止非GET|HEAD|POST方式的抓取

  10. if ($request_method !~ ^(GET|HEAD|POST)$) {


  11.     return 403;


  12. }


复制代码


添加完毕后保存,重启nginx即可,这样这些蜘蛛或工具扫描网站的时候就会提示403禁止访问。
注意:如果你网站使用火车头采集发布,使用以上代码会返回403错误,发布不了的。如果想使用火车头采集发布,请使用下面的代码:

https://www.4414.cn/static/image/common/codebg.gif";) 0px 0px repeat-y rgb(247, 247, 247); color: rgb(102, 102, 102); zoom: 1; border: 1px solid rgb(204, 204, 204); font-family: "microsoft yahei", Tahoma, Helvetica, SimSun, sans-serif; font-size: 14px; text-wrap: wrap;">
  1. #禁止Scrapy等工具的抓取

  2. if ($http_user_agent ~* (Scrapy|Curl|HttpClient|crawl|curb|git|Wtrace)) {


  3.      return 403;


  4. }


  5. #禁止指定UA及UA为空的访问

  6. if ($http_user_agent ~* "CheckMarkNetwork|Synapse|Nimbostratus-Bot|Dark|scraper|LMAO|Hakai|Gemini|Wappalyzer|masscan|crawler4j|Mappy|Center|eright|aiohttp|MauiBot|Crawler|researchscan|Dispatch|AlphaBot|Census|ips-agent|NetcraftSurveyAgent|ToutiaoSpider|EasyHttp|Iframely|sysscan|fasthttp|muhstik|DeuSu|mstshash|HTTP_Request|ExtLinksBot|package|SafeDNSBot|CPython|SiteExplorer|SSH|MegaIndex|BUbiNG|CCBot|NetTrack|Digincore|aiHitBot|SurdotlyBot|null|SemrushBot|Test|Copied|ltx71|Nmap|DotBot|AdsBot|InetURL|Pcore-HTTP|PocketParser|Wotbox|newspaper|DnyzBot|redback|PiplBot|SMTBot|WinHTTP|Auto Spider 1.0|GrabNet|TurnitinBot|Go-Ahead-Got-It|Download Demon|Go!Zilla|GetWeb!|GetRight|libwww-perl|Cliqzbot|MailChimp|SMTBot|Dataprovider|XoviBot|linkdexbot|SeznamBot|Qwantify|spbot|evc-batch|zgrab|Go-http-client|FeedDemon|Jullo|Feedly|YandexBot|oBot|FlightDeckReports|Linguee Bot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|EasouSpider|LinkpadBot|Ezooms ) {


  7.      return 403;


  8. }


  9. #禁止非GET|HEAD|POST方式的抓取

  10. if ($request_method !~ ^(GET|HEAD|POST)$) {


  11.     return 403;


  12. }

复制代码

设置完了可以用模拟爬去来看看有没有误伤了好蜘蛛,说明:以上屏蔽的蜘蛛名不包括以下常见的6大蜘蛛名:百度蜘蛛:Baiduspider谷歌蜘蛛:Googlebot必应蜘蛛:bingbot搜狗蜘蛛:Sogou web spider360蜘蛛:360Spider神马蜘蛛:YisouSpider爬虫常见的User-Agent如下:

https://www.4414.cn/static/image/common/codebg.gif";) 0px 0px repeat-y rgb(247, 247, 247); color: rgb(102, 102, 102); zoom: 1; border: 1px solid rgb(204, 204, 204); font-family: "microsoft yahei", Tahoma, Helvetica, SimSun, sans-serif; font-size: 14px; text-wrap: wrap;">
  1. FeedDemon       内容采集

  2. BOT/0.1 (BOT for JCE) sql注入

  3. CrawlDaddy      sql注入

  4. Java         内容采集

  5. Jullo         内容采集

  6. Feedly        内容采集

  7. UniversalFeedParser  内容采集

  8. ApacheBench      cc攻击器

  9. Swiftbot       无用爬虫

  10. YandexBot       无用爬虫

  11. AhrefsBot       无用爬虫

  12. jikeSpider      无用爬虫

  13. MJ12bot        无用爬虫

  14. ZmEu phpmyadmin    漏洞扫描

  15. WinHttp        采集cc攻击

  16. EasouSpider      无用爬虫

  17. HttpClient      tcp攻击

  18. Microsoft URL Control 扫描

  19. YYSpider       无用爬虫

  20. jaunty        wordpress爆破扫描器

  21. oBot         无用爬虫

  22. Python-urllib     内容采集

  23. Indy Library     扫描

  24. FlightDeckReports Bot 无用爬虫

  25. Linguee Bot      无用爬虫

复制代码




0
赏钱
收藏
点击回复
      全部留言
  • 0
更多回复
恢复多功能编辑器
  • 3 1
  • IDC服务
        圈内贴子1
    • 站长圈成员1
    本圈子内的新贴

    推荐内容
    扫一扫访问手机版