Wondering how to do some math with some telegraf generated GPU (nvidia smi) metrics and sumo.
If I wanted to measure utilization over time as a percentage... what is the best way. I have added my own panel to your host app and am measuring mem and activity but what I wanted to do was measure total utilization over time...
So I can get a count of GPUs... let's say 4. so each hour or use should represent 240 minutes.
Could be I am barking up the wrong tree and thinking more log queries
_sourceCategory=whatever/path/to/bucket host.name=some_host metric=nvidia_smi field=utilization_gpu | count by uuid | sum
Gets me 4.
Polling is every ~5 minutes. (looks like I 10 per hour though, not 12)
But is there any way to do some math and variable assignments here? number of hours in a day against every active data point for each GPU? (currently ever 5 min, though reporting every 6? 10 per hour, lag?)
Thinking Telegraf doc section could use more extensive query examples with both your defined set of plugins from your sample confs plus getting at the drilled down data and some other plugins that could be configured like nvidia_smi on the query side
Cross platform sumo collector monitoring is tricky unless you use a pattern and the right search pattern and the right search method (native, in the conf file), and process executable name.. e.g. "process.executable.name=java* pattern=sumo* " some of that being covered would have been great.
It is an incredibly useful plugin, but I feel like the platform gets in the way or doesn't give me enough to go on (the Telegraf git repo is awesome, but you and it together kinda thing)
Please sign in to leave a comment.