Identify Known Crawlers/Bots in AWS Load Balancer (ALB or ELB) Logs
Figured I'd share this, as I found it helpful in viewing how many requests are coming from known crawlers over a period of time. This query is timesliced for visualization using the Area Chart feature with stacking. It assumes that you have your user_agent field parsed in a Field Extraction Rule. I also have this posted as a Gist here: https://gist.github.com/austinorth/078079924eb467fc0eef64c3efbad2fd
_sourceCategory=your/ALB/logs ("Googlebot" OR "AskJeeves" OR "Digger" OR "Lycos"
OR "msnbot" OR "Inktomi Slurp" OR "Yahoo" OR "Nutch" OR "bingbot" OR
"BingPreview" OR "Mediapartners-Google" OR "proximic" OR "AhrefsBot" OR
"AdsBot-Google" OR "Ezooms" OR "AddThis.com" OR "facebookexternalhit" OR
"MetaURI" OR "Feedfetcher-Google" OR "PaperLiBot" OR "TweetmemeBot" OR
"Sogou web spider" OR "GoogleProducer" OR "RockmeltEmbedder" OR
"ShareThisFetcher" OR "YandexBot" OR "rogerbot-crawler" OR "ShowyouBot" OR
"Baiduspider" OR "Sosospider" OR "Exabot" OR "www.comscore.com/Web-Crawler" OR "mj12bot.com"
OR "GrapeshotCrawler" OR "NTENTbot" OR "BLEXBot" OR "Clickagy Intelligence Bot v2" OR "Applebot"
OR "AmazonAdBot" OR "SemrushBot" OR "Cliqzbot" OR "alexa site audit" OR "MailChimp" OR "Yeti" OR "PiplBot"
OR "DotBot")
| parse regex field=user_agent "(?<bot_name>facebook)externalhit?\W+" nodrop
| parse regex field=user_agent "Feedfetcher-(?<bot_name>Google?)\S+" nodrop
| parse regex field=user_agent "(?<bot_name>PaperLiBot?)/.+" nodrop
| parse regex field=user_agent "(?<bot_name>TweetmemeBot?)/.+" nodrop
| parse regex field=user_agent "(?<bot_name>msn?)bot\W" nodrop
| parse regex field=user_agent "(?<bot_name>Nutch?)-.+" nodrop
| parse regex field=user_agent "(?<bot_name>Google?)bot\W" nodrop
| parse regex field=user_agent "Feedfetcher-(?<bot_name>Google?)\W" nodrop
| parse regex field=user_agent "(?<bot_name>Yahoo?)!\s+Slurp[;/].+" nodrop
| parse regex field=user_agent "(?<bot_name>bing?)bot\W" nodrop
| parse regex field=user_agent "(?<bot_name>Bing?)Preview\W" nodrop
| parse regex field=user_agent "(?<bot_name>Sogou?)\s+web\s" nodrop
| parse regex field=user_agent "(?<bot_name>Yandex?)Bot\W" nodrop
| parse regex field=user_agent "(?<bot_name>rogerbot?)\W" nodrop
| parse regex field=user_agent "(?<bot_name>AddThis\.com?)\s+robot\s+" nodrop
| parse regex field=user_agent "(?<bot_name>ShareThis?)Fetcher/.+" nodrop
| parse regex field=user_agent "(?<bot_name>Ahrefs?)Bot/.+" nodrop
| parse regex field=user_agent "(?<bot_name>DuckDuck?)Bot/.+" nodrop
| parse regex field=user_agent "(?<bot_name>MetaURI?)\s+API/.+" nodrop
| parse regex field=user_agent "(?<bot_name>Showyou?)Bot\s+" nodrop
| parse regex field=user_agent "(?<bot_name>Google?)Producer;" nodrop
| parse regex field=user_agent "(?<bot_name>Ezooms?)\W" nodrop
| parse regex field=user_agent "(?<bot_name>Rockmelt?)Embedder\s+" nodrop
| parse regex field=user_agent "(?<bot_name>Sosospider?)\W" nodrop
| parse regex field=user_agent "(?<bot_name>Baidu?)spider" nodrop
| parse regex field=user_agent "(?<bot_name>Exabot?)\W" nodrop
| parse regex field=user_agent "(?<bot_name>www.comscore.com/Web-Crawler)\W" nodrop
| parse regex field=user_agent "(?<bot_name>mj12bot.com)\W" nodrop
| parse regex field=user_agent "(?<bot_name>GrapeshotCrawler)\W" nodrop
| parse regex field=user_agent "(?<bot_name>NTENTbot)\W" nodrop
| parse regex field=user_agent "(?<bot_name>BLEXBot)\W" nodrop
| parse regex field=user_agent "(?<bot_name>Clickagy?) Intelligence Bot v2" nodrop
| parse regex field=user_agent "(?<bot_name>Applebot)\W" nodrop
| parse regex field=user_agent "(?<bot_name>AmazonAdBot)\W" nodrop
| parse regex field=user_agent "(?<bot_name>SemrushBot)\W" nodrop
| parse regex field=user_agent "(?<bot_name>Cliqzbot)\W" nodrop
| parse regex field=user_agent "(?<bot_name>alexa site audit)\W" nodrop
| parse regex field=user_agent "(?<bot_name>MailChimp)\W" nodrop
| parse regex field=user_agent "(?<bot_name>Yeti)\W" nodrop
| parse regex field=user_agent "(?<bot_name>Piplbot)\W" nodrop
| parse regex field=user_agent "(?<bot_name>DotBot)\W" nodrop
| where bot_name != ""
| if (bot_name="bing","Bing",bot_name) as bot_name
| timeslice 5m
| count as hits by bot_name, _timeslice
| transpose row _timeslice column bot_name
Please sign in to leave a comment.
Comments
1 comment