Detecting collector failure

Comments

2 comments

  • Avatar
    Roy Gunter
    Hi Chad, We only have a few dozen servers I have to watch, so this approach has worked for me. I've created a dashboard that lists the last record received from each _sourceHost. I generally know how many servers we're monitoring, so the query could be changed to give a count of the number of _sourceHosts that a record has been received from in the last hour, and send an email sent if the count falls below the expected number. The query is: _sourceHost=Prod* | max(_messagetime) as time by _sourcehost | formatDate(fromMillis(toLong(time)),"MM-dd-yyyy HH:mm:ss:SSS", "America/New_York") as time | sort by +_sourceHost Regards, Roy
    0
    Comment actions Permalink
  • Avatar
    Chad Nicely
    Thanks, Roy. This is great. We have ephemeral/elastic infrastructure, so the host approach wasn't an exact fit. But we do have an set of critical services that we're monitoring (and have a field extraction rule which defines "service" and "environment" fields). So, I ended up modifying the query like so: * | where service in ("foo","bar","bat","baz","baa") | where environment="production" | max(_messagetime) as time by service | formatDate(fromMillis(toLong(time)),"MM-dd-yyyy HH:mm:ss:SSS", "America/Los_Angeles") as time | sort by +service This approach also provides a deterministic count of services to compare to (5 in this example). Thanks again, Chad
    0
    Comment actions Permalink

Please sign in to leave a comment.