Nginx禁止爬虫访问的方法
if ($http_user_agent ~* "Scrapy|Baiduspider|Curl|HttpClient|Bytespider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser |ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSp ider|Ezooms|^$"){ return 403; }
如需跳转其他页面,配置如下:
if ($http_user_agent ~* "Scrapy|Baiduspider|Curl|HttpClient|Bytespider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser |ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSp ider|Ezooms|^$") { return 301 https://yoursite.com; }
如需禁止特定来源用户,配置如下:
if ($http_referer ~ "baidu\.com|google\.net|bing\.com") { return 403; }
如需仅允许GET,HEAD和POST请求,配置如下:
#fbrbidden not GET|HEAD|POST method access if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; }
Apache禁用爬虫的配置
mod_rewrite模块确定开启的前提下,在.htaccess文件或者相应的.conf文件,添加以下内容:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC] RewriteRule . - [R=403,L]