Re: separate monitoring node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Denny,

I should have mentioned this as well. Any ceph cluster wide checks I am doing with Icinga are only applied to my 3 mon/mgr nodes. They would definitely be annoying if it was on all osd nodes. Having the checks on all of the mons allows me to not lose monitoring ability should one go down.

The ceph mgr dashboard is only enabled on the mgr daemons. I'm not familiar with the mimic dashboard yet, but it is much more advanced than luminous' dashboard and may have some alerting abilities built in.

With your PCI DSS restrictions a VM monitoring node may work well. I'd set up the VM with ceph-common, the conf and a restricted keyring then have icinga2 run a nrpe check on it that calls the check_ceph, ceph -s, or whaterver.

Kevin

On 06/19/2018 04:13 PM, Denny Fuchs wrote:
hi,

Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx>:

# ceph auth get client.icinga
exported keyring for client.icinga
[client.icinga]
    key = <nope>
    caps mgr = "allow r"
    caps mon = "allow r"
thats the point: It's OK, to check, if all processes are up and running and may some checks for the disks. But imagine, you check the "health" state and the state is on all OSDs the same, because ... its a cluster. So if you put on one node "ceph osd set noout", you get a warning for every OSD node (check_ceph_health). The same for every check, that monitors a cluster wide setting, like df, lost osd (70 in from 72 ...) The most checks have also performance data (which can be disabled), which is saved in a database.
The same for Telegraf(*): every node transmits the same data (because the cluster data is the same on all nodes).

I've took also a look on the Ceph mgr dashboard (for a few minutes), which I have to enable on all(?) OSD nodes and build a construct, to get the dashboard from the active mgr.

I don't believe, that I'm the first person who thinking about a dedicate VM, which is only used for monitoring tools (Icinga / Zabbix / Nagios / Dashboard / ceph -s) and get the overall status (and performance data) from it. The only thing which I need to keep is the OSD (I/O) disk and network on the OSD nodes directly, but thanks to InfluxDB ... I can put them on one dashboard :-)

@Kevin nice work :-) Because of PCI DSS, the Icinga2 master can't reach the Ceph directly, to we have satellite / agent construct to the checks executed.

cu denny

ps. One bad thing: Telegraf can't read the /var/run/ceph/*sock files, because of the perms after the OSD services starts (https://github.com/influxdata/telegraf/issues/1657). This was fixed, but I didn't checked, if this patch was also in Proxmox Ceph packages included.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux