最新公告
  • 欢迎您光临源码资源下载站,一个优质的网站源码和小程序源码分享基地。
  • 识别User Agent屏蔽一些Web爬虫防采集

    正文概述 建站知识   2023-12-15 21:25:09  

      自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来辨别和屏蔽一些采集者,在Apache中设置的代码例子如下:

      RewriteCond %{HTTP_USER_AGENT} ^(.*)(DTSsAgent|CreativesAutoUpdate|HTTrack|YisouSpider|SemrushBot)(.*)$

      RewriteRule .* - [F,L]

      屏蔽User Agent为空的代码:

      RewriteCond %{HTTP_USER_AGENT} ^$

      RewriteRule .* - [F]

      屏蔽Referer和User Agent都为空的代码:

      RewriteCond %{HTTP_REFERER} ^$ [NC]

      RewriteCond %{HTTP_USER_AGENT} ^$ [NC]

      RewriteRule .* - [F]

      下面把一些可以屏蔽的常见采集软件或者机器爬虫的User Agent的特征关键词列一下供参考:

      User-Agent

      DTS Agent

      HttpClient

      Owlin

      Kazehakase

      Creative AutoUpdate

      HTTrack

      YisouSpider

      baiduboxapp

      Python-urllib

      python-requests

      SemrushBot

      SearchmetricsBot

      MegaIndex

      Scrapy

      EMail Exractor

      007ac9

      ltx71

      其它也可以考虑屏蔽的:

      Mail.RU_Bot:http://go.mail.ru/help/robots

      Feedly

      ZumBot

      Pcore-HTTP

      Daum

      your-server

      Mobile/12A4345d

      PhantomJS/2.1.1

      archive.org_bot

      AcooBrowser

      Go-http-client

      Jakarta Commons-HttpClient

      Apache-HttpClient

      BDCbot

      ECCP

      Nutch

      cr4nk

      MJ12bot

      MOT-MPx220

      Y!OASIS/TEST

      libwww-perl

      一般不要屏蔽的主流搜索引擎特征:

      Google

      Baidu

      Yahoo

      Slurp

      yandex

      YandexBot

      MSN

      一些常见浏览器或者通用代码也不要轻易屏蔽:

      FireFox

      Apple

      PC

      Chrome

      Microsoft

      Android

      Mail

      Windows

      Mozilla

      Safar

      Macintosh

      有的时候是采集者单独设置的User Agent,也可以通过分析后进行屏蔽,例如:

      RewriteCond %{HTTP_USER_AGENT} ^(.*)('Mozilla/5.0|'Mozilla'|'Moz'|'Mozil'|'(.+)'|Mobile/13G34|Chrome/53.0.2785.143)(.*)$

      RewriteRule .* - [F,L]

      或者与HTTP_USER_AGENT一起考虑其它的因素再联合判断检测、屏蔽,例如:

      RewriteCond %{REQUEST_METHOD} POST

      RewriteCond %{HTTP_USER_AGENT} ^(.*)(Firefox/44.0|Safari/537.36)(.*)$

      RewriteCond %{REQUEST_URI} ^(.*)/comment/reply/(.*)$

      RewriteRule .* - [F,L]

      上面这是遇到反复POST提交留言的情况,判断特征进行屏蔽。

      网上也找了一些其它的代码,列出供参考:

      RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC]

      RewriteRule ^(.*)$ - [F]

      除了修改.htaccess文件以外,还可以通过修改httpd.conf配置文件来实现:

      DocumentRoot /home/wwwroot/xxx

      SetEnvIfNoCase User-Agent ".*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" BADBOT

      Order allow,deny

      Allow from all

      deny from env=BADBOT

      这样修改后需要重启Apache。别人列出的需要屏蔽特征:

      FeedDemon 内容采集

      BOT/0.1 (BOT for JCE) sql注入

      CrawlDaddy sql注入

      Java 内容采集

      Jullo 内容采集

      Feedly 内容采集

      UniversalFeedParser 内容采集

      ApacheBench cc攻击器

      Swiftbot 无用爬虫

      YandexBot 无用爬虫

      AhrefsBot 无用爬虫

      YisouSpider 无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)

      MJ12bot 无用爬虫

      ZmEu phpmyadmin 漏洞扫描

      WinHttp 采集cc攻击

      EasouSpider 无用爬虫

      HttpClient tcp攻击

      Microsoft URL Control 扫描

      YYSpider 无用爬虫

      jaunty wordpress爆破扫描器

      oBot 无用爬虫

      Python-urllib 内容采集

      Indy Library 扫描

      FlightDeckReports Bot 无用爬虫

      Linguee Bot 无用爬虫

      继续补充:

      WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot

      还有:

      Aboundex

      80legs

      ^Java

      ^Cogentbot

      ^Alexibot

      ^asterias

      ^attach

      ^BackDoorBot

      ^BackWeb

      Bandit

      ^BatchFTP

      ^Bigfoot

      ^Black.Hole

      ^BlackWidow

      ^BlowFish

      ^BotALot

      Buddy

      ^BuiltBotTough

      ^Bullseye

      ^BunnySlippers

      ^Cegbfeieh

      ^CheeseBot

      ^CherryPicker

      ^ChinaClaw

      Collector

      Copier

      ^CopyRightCheck

      ^cosmos

      ^Crescent

      ^Custo

      ^AIBOT

      ^DISCo

      ^DIIbot

      ^DittoSpyder

      ^Download Demon

      ^Download Devil

      ^Download Wonder

      ^dragonfly

      ^Drip

      ^eCatch

      ^EasyDL

      ^ebingbong

      ^EirGrabber

      ^EmailCollector

      ^EmailSiphon

      ^EmailWolf

      ^EroCrawler

      ^Exabot

      ^Express WebPictures

      Extractor

      ^EyeNetIE

      ^Foobot

      ^flunky

      ^FrontPage

      ^Go-Ahead-Got-It

      ^gotit

      ^GrabNet

      ^Grafula

      ^Harvest

      ^hloader

      ^HMView

      ^HTTrack

      ^humanlinks

      ^IlseBot

      ^Image Stripper

      ^Image Sucker

      Indy Library

      ^InfoNaviRobot

      ^InfoTekies

      ^Intelliseek

      ^InterGET

      ^Internet Ninja

      ^Iria

      ^Jakarta

      ^JennyBot

      ^JetCar

      ^JOC

      ^JustView

      ^Jyxobot

      ^Kenjin.Spider

      ^Keyword.Density

      ^larbin

      ^LexiBot

      ^lftp

      ^libWeb/clsHTTP

      ^likse

      ^LinkextractorPro

      ^LinkScan/8.1a.Unix

      ^LNSpiderguy

      ^LinkWalker

      ^lwp-trivial

      ^LWP::Simple

      ^Magnet

      ^Mag-Net

      ^MarkWatch

      ^Mass Downloader

      ^Mata.Hari

      ^Memo

      ^Microsoft.URL

      ^Microsoft URL Control

      ^MIDown tool

      ^MIIxpc

      ^Mirror

      ^Missigua Locator

      ^Mister PiX

      ^moget

      ^Mozilla/3.Mozilla/2.01

      ^Mozilla.*NEWT

      ^NAMEPROTECT

      ^Navroad

      ^NearSite

      ^NetAnts

      ^Netcraft

      ^NetMechanic

      ^NetSpider

      ^Net Vampire

      ^NetZIP

      ^NextGenSearchBot

      ^NG

      ^NICErsPRO

      ^niki-bot

      ^NimbleCrawler

      ^Ninja

      ^NPbot

      ^Octopus

      ^Offline Explorer

      ^Offline Navigator

      ^Openfind

      ^OutfoxBot

      ^PageGrabber

      ^Papa Foto

      ^pavuk

      ^pcBrowser

      ^PHP version tracker

      ^Pockey

      ^ProPowerBot/2.14

      ^ProWebWalker

      ^psbot

      ^Pump

      ^QueryN.Metasearch

      ^RealDownload

      Reaper

      Recorder

      ^ReGet

      ^RepoMonkey

      ^RMA

      Siphon

      ^SiteSnagger

      ^SlySearch

      ^SmartDownload

      ^Snake

      ^Snapbot

      ^Snoopy

      ^sogou

      ^SpaceBison

      ^SpankBot

      ^spanner

      ^Sqworm

      Stripper

      Sucker

      ^SuperBot

      ^SuperHTTP

      ^Surfbot

      ^suzuran

      ^Szukacz/1.4

      ^tAkeOut

      ^Teleport

      ^Telesoft

      ^TurnitinBot/1.5

      ^The.Intraformant

      ^TheNomad

      ^TightTwatBot

      ^Titan

      ^True_Robot

      ^turingos

      ^TurnitinBot

      ^URLy.Warning

      ^Vacuum

      ^VCI

      ^VoidEYE

      ^Web Image Collector

      ^Web Sucker

      ^WebAuto

      ^WebBandit

      ^Webclipping.com

      ^WebCopier

      ^WebEMailExtrac.*

      ^WebEnhancer

      ^WebFetch

      ^WebGo IS

      ^Web.Image.Collector

      ^WebLeacher

      ^WebmasterWorldForumBot

      ^WebReaper

      ^WebSauger

      ^Website eXtractor

      ^Website Quester

      ^Webster

      ^WebStripper

      ^WebWhacker

      ^WebZIP

      Whacker

      ^Widow

      ^WISENutbot

      ^WWWOFFLE

      ^WWW-Collector-E

      ^Xaldon

      ^Xenu

      ^Zeus

      ZmEu

      ^Zyborg

      Acunetix

      FHscan

      临时屏蔽(返回503错误),而不是长期屏蔽的代码:

      RewriteCond %{HTTP_USER_AGENT} ^.*(bot|crawl|spider).*$ [NC]

      RewriteCond %{REQUEST_URI} !^/robots.txt$

      RewriteRule .* - [R=503,L]

    识别User Agent屏蔽一些Web爬虫防采集
    皓玉源码网,一个优质的源码资源平台!
    皓玉源码网 » 识别User Agent屏蔽一些Web爬虫防采集