Detecting collector failure
Occasionally, our collectors stop ... collecting. The processes are running, but (afaik) there's nothing incriminating in the logs. Does anyone have any best practices for detecting and alerting to the condition when logs are expected but not received from a given source?
-
Hi Chad, We only have a few dozen servers I have to watch, so this approach has worked for me. I've created a dashboard that lists the last record received from each _sourceHost. I generally know how many servers we're monitoring, so the query could be changed to give a count of the number of _sourceHosts that a record has been received from in the last hour, and send an email sent if the count falls below the expected number. The query is: _sourceHost=Prod* | max(_messagetime) as time by _sourcehost | formatDate(fromMillis(toLong(time)),"MM-dd-yyyy HH:mm:ss:SSS", "America/New_York") as time | sort by +_sourceHost Regards, Roy -
Thanks, Roy. This is great. We have ephemeral/elastic infrastructure, so the host approach wasn't an exact fit. But we do have an set of critical services that we're monitoring (and have a field extraction rule which defines "service" and "environment" fields). So, I ended up modifying the query like so: * | where service in ("foo","bar","bat","baz","baa") | where environment="production" | max(_messagetime) as time by service | formatDate(fromMillis(toLong(time)),"MM-dd-yyyy HH:mm:ss:SSS", "America/Los_Angeles") as time | sort by +service This approach also provides a deterministic count of services to compare to (5 in this example). Thanks again, Chad
Please sign in to leave a comment.
Comments
2 comments