Re: separate monitoring node

Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx> · Wed, 20 Jun 2018 10:42:01 -0500

    Denny,

      I should have mentioned this as well. Any ceph cluster wide checks
      I am doing with Icinga are only applied to my 3 mon/mgr nodes.
      They would definitely be annoying if it was on all osd nodes.
      Having the checks on all of the mons allows me to not lose
      monitoring ability should one go down.

      The ceph mgr dashboard is only enabled on the mgr daemons. I'm not
      familiar with the mimic dashboard yet, but it is much more
      advanced than luminous' dashboard and may have some alerting
      abilities built in.

      With your PCI DSS restrictions a VM monitoring node may work well.
      I'd set up the VM with ceph-common, the conf and a restricted
      keyring then have icinga2 run a nrpe check on it that calls the
      check_ceph, ceph -s, or whaterver.

      Kevin

    On 06/19/2018 04:13 PM, Denny Fuchs
      wrote:

      hi,

        Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx>:

# ceph auth get client.icinga
exported keyring for client.icinga
[client.icinga]
    key = <nope>
    caps mgr = "allow r"
    caps mon = "allow r"

      thats the point: It's OK, to check, if all processes are up and running and may some checks for the disks. But imagine, you check the "health" state and the state is on all OSDs the same, because ... its a cluster. So if you put on one node "ceph osd set noout", you get a warning for every OSD node (check_ceph_health). The same for every check, that monitors a cluster wide setting, like df, lost osd (70 in from 72 ...) The most checks have also performance data (which can be disabled), which is saved in a database.
The same for Telegraf(*): every node transmits the same data (because the cluster data is the same on all nodes).

I've took also a look on the Ceph mgr dashboard (for a few minutes), which I have to enable on all(?) OSD nodes and build a construct, to get the dashboard from the active mgr.

I don't believe, that I'm the first person who thinking about a dedicate VM, which is only used for monitoring tools (Icinga / Zabbix / Nagios / Dashboard / ceph -s) and get the overall status (and performance data) from it. The only thing which I need to keep is the OSD (I/O) disk and network on the OSD nodes directly, but thanks to InfluxDB ... I can put them on one dashboard :-)

@Kevin nice work :-) Because of PCI DSS, the Icinga2 master can't reach the Ceph directly, to we have satellite / agent construct to the checks executed.

cu denny

ps. One bad thing: Telegraf can't read the /var/run/ceph/*sock files, because of the perms after the OSD services starts (https://github.com/influxdata/telegraf/issues/1657). This was fixed, but I didn't checked, if this patch was also in Proxmox Ceph packages included.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com