Re: why sudden (and brief) HEALTH_ERR

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Wed, 4 Oct 2017 09:08:35 +0200

On 17-10-04 08:51 AM, lists wrote:
Hi,

Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel 
migration, and noticed something interesting.

After I brought back up the OSDs I just chowned, the system had some 
recovery to do. During that recovery, the system went to HEALTH_ERR for a 
short moment:

See below, for consecutive ceph -s outputs:

[..]
root@pm2:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
     health HEALTH_ERR
            2 pgs are stuck inactive for more than 300 seconds

^^ that.

            761 pgs degraded
            2 pgs recovering
            181 pgs recovery_wait
            2 pgs stuck inactive
            273 pgs stuck unclean
            543 pgs undersized
            recovery 1394085/8384166 objects degraded (16.628%)
            4/24 in osds are down
            noout flag(s) set
     monmap e3: 3 mons at 
{0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
            election epoch 256, quorum 0,1,2 0,1,2
     osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs
            flags noout,sortbitwise,require_jewel_osds
      pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
            32724 GB used, 56656 GB / 89380 GB avail
            1394085/8384166 objects degraded (16.628%)
                 543 active+undersized+degraded
                 310 active+clean
                 181 active+recovery_wait+degraded
                  26 active+degraded
                  13 active
                   9 activating+degraded
                   4 activating
                   2 active+recovering+degraded
recovery io 133 MB/s, 37 objects/s
  client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr
[..]
It was only very briefly, but it did worry me a bit, fortunately, we went 
back to the expected HEALTH_WARN very quickly, and everything finished fine, 
so I guess nothing to worry.

But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR?

No smart errors, apply and commit latency are all within the expected 
ranges, the systems basically is healthy.

Curious :-)

Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never 
contacted", resulting in "pgs are stuck inactive for more than 300 seconds" 
being reported until osds regain connections between themselves.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com