On Fri, Jan 13, 2017 at 12:21 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > Hello, > > On Thu, 12 Jan 2017 14:35:32 +0000 Matthew Vernon wrote: > >> Hi, >> >> One of our ceph servers froze this morning (no idea why, alas). Ceph >> noticed, moved things around, and when I ran ceph -s, said: >> >> root@sto-1-1:~# ceph -s >> cluster 049fc780-8998-45a8-be12-d3b8b6f30e69 >> health HEALTH_OK >> monmap e2: 3 mons at >> {sto-1-1=172.27.6.11:6789/0,sto-2-1=172.27.6.14:6789/0,sto-3-1=172.27.6.17:6789/0} >> election epoch 250, quorum 0,1,2 sto-1-1,sto-2-1,sto-3-1 >> osdmap e9899: 540 osds: 480 up, 480 in >> flags sortbitwise >> pgmap v4549229: 20480 pgs, 25 pools, 7559 GB data, 1906 kobjects >> 22920 GB used, 2596 TB / 2618 TB avail >> 20480 active+clean >> client io 5416 kB/s rd, 6598 kB/s wr, 44 op/s rd, 53 op/s wr >> >> Is it intentional that it says HEALTH_OK when an entire server's worth >> of OSDs are dead? you have to look quite hard at the output to notice >> that 60 OSDs are unaccounted for. >> > What Wido said. > Though there have been several discussions and nodding of heads that the > current states of Ceph are pitifully limited and for many people simply > inaccurate. > As in, separating them in something like OK, INFO, WARN, ERR and having > configuration options to determine what situation equates what state. If anyone is interested in working on this, I'd recommend tidying up the existing health reporting as a first step: http://tracker.ceph.com/issues/7192 Currently, the health messages are just a string and a severity: the first step to being able to selectively silence them would be to formalize the definitions and give each possible health condition a unique ID. John > > Of course you should be monitoring your cluster with other tools like > nagios, from general availability on all network ports, disk usage, SMART > wear out levels of SSDs down to the individual processes you'd expect to > see running on a node: > "PROCS OK: 8 processes with command name 'ceph-osd' " > > I lost single OSDs a few times and didn't notice either by looking at > Nagios as the recovery was so quick. > > Christian > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com