We are heavy users of Chef at my company, but we've been having some intermittent issues where the Chef client hangs forever on a server. This is extremely bad for us, and I'm hoping to use Sumo Logic to monitor this situation so I can at least get alerted when this happens, and restart the Chef client by hand.
To do this, I've added Chef's output log to Sumo Logic. My thinking is that the most foolproof way to monitor for the app becoming a zombie is to detect if any of my servers stops receiving new input into the log. Seems simple enough. Only problem is - I have no idea how to write that query, or whether it is even possible. The query would ask something along the lines of:
"What are the hosts that have not had anything appended to the chef log in the last hour?"
Once this is written, I can then easily tell Sumo Logic to alert me when this query gives me more than zero results. I originally tried:
_sourceCategory=Grocery/Chef | count by _sourceHost | where _count = 0
...but this logic is wrong, since the broken host won't have any entries, and thus will not ever be included in a query result.
Does anybody know how to do this?
Thanks in advance,
Please sign in to leave a comment.