自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来辨别和屏蔽一些采集者,在Apache中设置的代码例子如下:
RewriteCond %{HTTP_USER_AGENT} ^(.*)(DTSsAgent|CreativesAutoUpdate|HTTrack|YisouSpider|SemrushBot)(.*)$
RewriteRule .* - [F,L]
屏蔽User Agent为空的代码:
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule .* - [F]
屏蔽Referer和User Agent都为空的代码:
RewriteCond %{HTTP_REFERER} ^$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^$ [NC]
RewriteRule .* - [F]
下面把一些可以屏蔽的常见采集软件或者机器爬虫的User Agent的特征关键词列一下供参考:
User-Agent
DTS Agent
HttpClient
Owlin
Kazehakase
Creative AutoUpdate
HTTrack
YisouSpider
baiduboxapp
Python-urllib
python-requests
SemrushBot
SearchmetricsBot
MegaIndex
Scrapy
EMail Exractor
007ac9
ltx71
其它也可以考虑屏蔽的:
Mail.RU_Bot:http://go.mail.ru/help/robots
Feedly
ZumBot
Pcore-HTTP
Daum
your-server
Mobile/12A4345d
PhantomJS/2.1.1
archive.org_bot
AcooBrowser
Go-http-client
Jakarta Commons-HttpClient
Apache-HttpClient
BDCbot
ECCP
Nutch
cr4nk
MJ12bot
MOT-MPx220
Y!OASIS/TEST
libwww-perl
一般不要屏蔽的主流搜索引擎特征:
Baidu
Yahoo
Slurp
yandex
YandexBot
MSN
一些常见浏览器或者通用代码也不要轻易屏蔽:
FireFox
Apple
PC
Chrome
Microsoft
Android
Windows
Mozilla
Safar
Macintosh
有的时候是采集者单独设置的User Agent,也可以通过分析后进行屏蔽,例如:
RewriteCond %{HTTP_USER_AGENT} ^(.*)('Mozilla/5.0|'Mozilla'|'Moz'|'Mozil'|'(.+)'|Mobile/13G34|Chrome/53.0.2785.143)(.*)$
RewriteRule .* - [F,L]
或者与HTTP_USER_AGENT一起考虑其它的因素再联合判断检测、屏蔽,例如:
RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{HTTP_USER_AGENT} ^(.*)(Firefox/44.0|Safari/537.36)(.*)$
RewriteCond %{REQUEST_URI} ^(.*)/comment/reply/(.*)$
RewriteRule .* - [F,L]
上面这是遇到反复POST提交留言的情况,判断特征进行屏蔽。
网上也找了一些其它的代码,列出供参考:
RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC]
RewriteRule ^(.*)$ - [F]
除了修改.htaccess文件以外,还可以通过修改httpd.conf配置文件来实现:
DocumentRoot /home/wwwroot/xxx
SetEnvIfNoCase User-Agent ".*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" BADBOT
Order allow,deny
Allow from all
deny from env=BADBOT
这样修改后需要重启Apache。别人列出的需要屏蔽特征:
FeedDemon 内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy sql注入
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
ApacheBench cc攻击器
Swiftbot 无用爬虫
YandexBot 无用爬虫
AhrefsBot 无用爬虫
YisouSpider 无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
MJ12bot 无用爬虫
ZmEu phpmyadmin 漏洞扫描
WinHttp 采集cc攻击
EasouSpider 无用爬虫
HttpClient tcp攻击
Microsoft URL Control 扫描
YYSpider 无用爬虫
jaunty wordpress爆破扫描器
oBot 无用爬虫
Python-urllib 内容采集
Indy Library 扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot 无用爬虫
继续补充:
WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot
还有:
Aboundex
80legs
^Java
^Cogentbot
^Alexibot
^asterias
^attach
^BackDoorBot
^BackWeb
Bandit
^BatchFTP
^Bigfoot
^Black.Hole
^BlackWidow
^BlowFish
^BotALot
Buddy
^BuiltBotTough
^Bullseye
^BunnySlippers
^Cegbfeieh
^CheeseBot
^CherryPicker
^ChinaClaw
Collector
Copier
^CopyRightCheck
^cosmos
^Crescent
^Custo
^AIBOT
^DISCo
^DIIbot
^DittoSpyder
^Download Demon
^Download Devil
^Download Wonder
^dragonfly
^Drip
^eCatch
^EasyDL
^ebingbong
^EirGrabber
^EmailCollector
^EmailSiphon
^EmailWolf
^EroCrawler
^Exabot
^Express WebPictures
Extractor
^EyeNetIE
^Foobot
^flunky
^FrontPage
^Go-Ahead-Got-It
^gotit
^GrabNet
^Grafula
^Harvest
^hloader
^HMView
^HTTrack
^humanlinks
^IlseBot
^Image Stripper
^Image Sucker
Indy Library
^InfoNaviRobot
^InfoTekies
^Intelliseek
^InterGET
^Internet Ninja
^Iria
^Jakarta
^JennyBot
^JetCar
^JOC
^JustView
^Jyxobot
^Kenjin.Spider
^Keyword.Density
^larbin
^LexiBot
^lftp
^libWeb/clsHTTP
^likse
^LinkextractorPro
^LinkScan/8.1a.Unix
^LNSpiderguy
^LinkWalker
^lwp-trivial
^LWP::Simple
^Magnet
^Mag-Net
^MarkWatch
^Mass Downloader
^Mata.Hari
^Memo
^Microsoft.URL
^Microsoft URL Control
^MIDown tool
^MIIxpc
^Mirror
^Missigua Locator
^Mister PiX
^moget
^Mozilla/3.Mozilla/2.01
^Mozilla.*NEWT
^NAMEPROTECT
^Navroad
^NearSite
^NetAnts
^Netcraft
^NetMechanic
^NetSpider
^Net Vampire
^NetZIP
^NextGenSearchBot
^NG
^NICErsPRO
^niki-bot
^NimbleCrawler
^Ninja
^NPbot
^Octopus
^Offline Explorer
^Offline Navigator
^Openfind
^OutfoxBot
^PageGrabber
^Papa Foto
^pavuk
^pcBrowser
^PHP version tracker
^Pockey
^ProPowerBot/2.14
^ProWebWalker
^psbot
^Pump
^QueryN.Metasearch
^RealDownload
Reaper
Recorder
^ReGet
^RepoMonkey
^RMA
Siphon
^SiteSnagger
^SlySearch
^SmartDownload
^Snake
^Snapbot
^Snoopy
^sogou
^SpaceBison
^SpankBot
^spanner
^Sqworm
Stripper
Sucker
^SuperBot
^SuperHTTP
^Surfbot
^suzuran
^Szukacz/1.4
^tAkeOut
^Teleport
^Telesoft
^TurnitinBot/1.5
^The.Intraformant
^TheNomad
^TightTwatBot
^Titan
^True_Robot
^turingos
^TurnitinBot
^URLy.Warning
^Vacuum
^VCI
^VoidEYE
^Web Image Collector
^Web Sucker
^WebAuto
^WebBandit
^Webclipping.com
^WebCopier
^WebEMailExtrac.*
^WebEnhancer
^WebFetch
^WebGo IS
^Web.Image.Collector
^WebLeacher
^WebmasterWorldForumBot
^WebReaper
^WebSauger
^Website eXtractor
^Website Quester
^Webster
^WebStripper
^WebWhacker
^WebZIP
Whacker
^Widow
^WISENutbot
^WWWOFFLE
^WWW-Collector-E
^Xaldon
^Xenu
^Zeus
ZmEu
^Zyborg
Acunetix
FHscan
临时屏蔽(返回503错误),而不是长期屏蔽的代码:
RewriteCond %{HTTP_USER_AGENT} ^.*(bot|crawl|spider).*$ [NC]
RewriteCond %{REQUEST_URI} !^/robots.txt$
RewriteRule .* - [R=503,L]