Re: HEALTH_OK when one server crashed?

John Spray <jspray@xxxxxxxxxx> · Fri, 13 Jan 2017 00:55:32 +0000

On Fri, Jan 13, 2017 at 12:21 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> On Thu, 12 Jan 2017 14:35:32 +0000 Matthew Vernon wrote:
>
>> Hi,
>>
>> One of our ceph servers froze this morning (no idea why, alas). Ceph
>> noticed, moved things around, and when I ran ceph -s, said:
>>
>> root@sto-1-1:~# ceph -s
>>     cluster 049fc780-8998-45a8-be12-d3b8b6f30e69
>>      health HEALTH_OK
>>      monmap e2: 3 mons at
>> {sto-1-1=172.27.6.11:6789/0,sto-2-1=172.27.6.14:6789/0,sto-3-1=172.27.6.17:6789/0}
>>             election epoch 250, quorum 0,1,2 sto-1-1,sto-2-1,sto-3-1
>>      osdmap e9899: 540 osds: 480 up, 480 in
>>             flags sortbitwise
>>       pgmap v4549229: 20480 pgs, 25 pools, 7559 GB data, 1906 kobjects
>>             22920 GB used, 2596 TB / 2618 TB avail
>>                20480 active+clean
>>   client io 5416 kB/s rd, 6598 kB/s wr, 44 op/s rd, 53 op/s wr
>>
>> Is it intentional that it says HEALTH_OK when an entire server's worth
>> of OSDs are dead? you have to look quite hard at the output to notice
>> that 60 OSDs are unaccounted for.
>>
> What Wido said.
> Though there have been several discussions and nodding of heads that the
> current states of Ceph are pitifully limited and for many people simply
> inaccurate.
> As in, separating them in something like OK, INFO, WARN, ERR and having
> configuration options to determine what situation equates what state.

If anyone is interested in working on this, I'd recommend tidying up
the existing health reporting as a first step:
http://tracker.ceph.com/issues/7192

Currently, the health messages are just a string and a severity: the
first step to being able to selectively silence them would be to
formalize the definitions and give each possible health condition a
unique ID.

John

>
> Of course you should be monitoring your cluster with other tools like
> nagios, from general availability on all network ports, disk usage, SMART
> wear out levels of SSDs down to the individual processes you'd expect to
> see running on a node:
> "PROCS OK: 8 processes with command name 'ceph-osd' "
>
> I lost single OSDs a few times and didn't notice either by looking at
> Nagios as the recovery was so quick.
>
> Christian
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com