Quoting Denny Fuchs (linuxmail@xxxxxxxx): > hi, > > > Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx>: > > > > # ceph auth get client.icinga > > exported keyring for client.icinga > > [client.icinga] > > key = <nope> > > caps mgr = "allow r" > > caps mon = "allow r" > > thats the point: It's OK, to check, if all processes are up and > running and may some checks for the disks. But imagine, you check the > "health" state and the state is on all OSDs the same, because ... its > a cluster. So if you put on one node "ceph osd set noout", you get a > warning for every OSD node (check_ceph_health). The same for every > check, that monitors a cluster wide setting, like df, lost osd (70 in > from 72 ...) The most checks have also performance data (which can be > disabled), which is saved in a database. The same for Telegraf(*): > every node transmits the same data (because the cluster data is the > same on all nodes). Just checking here: Are you using the telegraf ceph plugin on the nodes? In that case you _are_ duplicating data. But the good news is that you don't need to. There is a Ceph mgr telegraf plugin now (mimic) which also works on luminous: http://docs.ceph.com/docs/master/mgr/telegraf/ You configure a listener ([[inputs.socket_listener]) on the nodes where you have ceph mgr running (probably mons) and have the mgr plugin send data to the socket. The telegraf daemon will pick it up and send it to influx (or whatever target you configured). As there is only one active mgr, you don't have the issue of duplicating data, and the solution is still HA. We have this systemd override snippet for telegraf to make the socket writable for ceph: [Service] ExecStartPre=/bin/sleep 5 ExecStartPost=/bin/chown ceph /tmp/telegraf.sock > > I've took also a look on the Ceph mgr dashboard (for a few minutes), > which I have to enable on all(?) OSD nodes and build a construct, to > get the dashboard from the active mgr. > > I don't believe, that I'm the first person who thinking about a > dedicate VM, which is only used for monitoring tools (Icinga / Zabbix > / Nagios / Dashboard / ceph -s) and get the overall status (and > performance data) from it. The only thing which I need to keep is the > OSD (I/O) disk and network on the OSD nodes directly, but thanks to > InfluxDB ... I can put them on one dashboard :-) On the Icinga (or sattelite node) you check only for cluster health. On the nodes you only check for node specific health. No overlap in health checks this way. Gr. Stefan -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com