2024-02-11

Prometheus notes

I’ve been working on $current_job’s monitoring stack for a few months now. It’s the first time I’ve been working with Prometheus, so it is a learning experience.

Here is a list of thoughts I have about the product in general :

The fact that all the config is in plain-text makes it easy to build incrementally.
There is no way to monitor info that is string-based and regularly changing. There are not a lot of info of this type that we’d like to monitor, but when we encounter it, it’s not a good time.
SNMP was hard to figure out, it kinda always is. But we got there.
The built-in way of looking at alerts is by alert type, whereas I’m more used to looking at them by host (or instance, in prometheus terms). I am currently thinking about making a page of all alerts which resembles the way Zabbix does it, as I can wrap my head around it a lot more easily.
It is fast and not CPU hungry, really well optimized, and robust.
You cannot choose different retention times per metric, it’s one dial only for the whole database. Maybe there is a way to cheat this with a separate long-term storage location with a different instance dedicated to it.

Maybe I’ll edit this with more info as I go on.