Prometheus notes

I’ve been working on $current_job’s monitoring stack for a few months now. It’s the first time I’ve been working with Prometheus, so it is a learning experience.

Here is a list of thoughts I have about the product in general :

  • The fact that all the config is in plain-text makes it easy to build incrementally.
  • There is no way to monitor info that is string-based and regularly changing. There are not a lot of info of this type that we’d like to monitor, but when we encounter it, it’s not a good time.
  • SNMP was hard to figure out, it kinda always is. But we got there.
  • The built-in way of looking at alerts is by alert type, whereas I’m more used to looking at them by host (or instance, in prometheus terms). I am currently thinking about making a page of all alerts which resembles the way Zabbix does it, as I can wrap my head around it a lot more easily.
  • It is fast and not CPU hungry, really well optimized, and robust.
  • You cannot choose different retention times per metric, it’s one dial only for the whole database. Maybe there is a way to cheat this with a separate long-term storage location with a different instance dedicated to it.

Maybe I’ll edit this with more info as I go on.