RFI: Prometheus, Etc, Services - Optimum Number To Run

duluxoz <duluxoz@xxxxxxxxx> · Sat, 20 Jan 2024 16:42:20 +1100

Hi All,

In regards to the monitoring services on a Ceph Cluster (ie Prometheus, 
Grafana, Alertmanager, Loki, Node-Exported, Promtail, etc) how many 
instances should/can we run for fault tolerance purposes? I can't seem 
to recall that advice being in the doco anywhere (but of course, I 
probably missed it).

I'm concerned about HA on those services - will they continue to run if 
the Ceph Node they're on fails?

At the moment we're running only 1 instance of each in the cluster, but 
several Ceph Nodes are capable of running each - ie/eg 3 nodes 
configured but only count:1.

This is on the latest version of Reef using cephadmin (if it makes a 
huge difference :-) ).

So any advice, etc, would be greatly appreciated, including if we should 
be running any services not mentioned (not Mgr, Mon, OSD, or iSCSI, 
obviously :-) )

Cheers

Dulux-Oz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx