why sudden (and brief) HEALTH_ERR

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel migration, and noticed something interesting.

After I brought back up the OSDs I just chowned, the system had some recovery to do. During that recovery, the system went to HEALTH_ERR for a short moment:

See below, for consecutive ceph -s outputs:

root@pm2:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
     health HEALTH_WARN
            1025 pgs degraded
            1 pgs recovering
            60 pgs recovery_wait
            307 pgs stuck unclean
            964 pgs undersized
            recovery 2477548/8384034 objects degraded (29.551%)
            6/24 in osds are down
            noout flag(s) set
     monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
            election epoch 256, quorum 0,1,2 0,1,2
     osdmap e10222: 24 osds: 18 up, 24 in; 964 remapped pgs
            flags noout,sortbitwise,require_jewel_osds
      pgmap v36531103: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
            32723 GB used, 56657 GB / 89380 GB avail
            2477548/8384034 objects degraded (29.551%)
                 964 active+undersized+degraded
                  63 active+clean
                  60 active+recovery_wait+degraded
                   1 active+recovering+degraded
recovery io 63410 kB/s, 15 objects/s
  client io 4348 kB/s wr, 0 op/s rd, 630 op/s wr
root@pm2:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
     health HEALTH_WARN
            942 pgs degraded
            1 pgs recovering
            118 pgs recovery_wait
            297 pgs stuck unclean
            823 pgs undersized
            recovery 2104751/8384079 objects degraded (25.104%)
            6/24 in osds are down
            noout flag(s) set
     monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
            election epoch 256, quorum 0,1,2 0,1,2
     osdmap e10224: 24 osds: 18 up, 24 in; 823 remapped pgs
            flags noout,sortbitwise,require_jewel_osds
      pgmap v36531118: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
            32723 GB used, 56657 GB / 89380 GB avail
            2104751/8384079 objects degraded (25.104%)
                 823 active+undersized+degraded
                 146 active+clean
                 118 active+recovery_wait+degraded
                   1 active+recovering+degraded
recovery io 61945 kB/s, 16 objects/s
  client io 2718 B/s rd, 5997 kB/s wr, 0 op/s rd, 638 op/s wr
root@pm2:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
     health HEALTH_ERR
            2 pgs are stuck inactive for more than 300 seconds
            761 pgs degraded
            2 pgs recovering
            181 pgs recovery_wait
            2 pgs stuck inactive
            273 pgs stuck unclean
            543 pgs undersized
            recovery 1394085/8384166 objects degraded (16.628%)
            4/24 in osds are down
            noout flag(s) set
     monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
            election epoch 256, quorum 0,1,2 0,1,2
     osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs
            flags noout,sortbitwise,require_jewel_osds
      pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
            32724 GB used, 56656 GB / 89380 GB avail
            1394085/8384166 objects degraded (16.628%)
                 543 active+undersized+degraded
                 310 active+clean
                 181 active+recovery_wait+degraded
                  26 active+degraded
                  13 active
                   9 activating+degraded
                   4 activating
                   2 active+recovering+degraded
recovery io 133 MB/s, 37 objects/s
  client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr
root@pm2:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
     health HEALTH_WARN
            725 pgs degraded
            27 pgs peering
            2 pgs recovering
            207 pgs recovery_wait
            269 pgs stuck unclean
            516 pgs undersized
            recovery 1325870/8384202 objects degraded (15.814%)
            3/24 in osds are down
            noout flag(s) set
     monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
            election epoch 256, quorum 0,1,2 0,1,2
     osdmap e10233: 24 osds: 21 up, 24 in; 418 remapped pgs
            flags noout,sortbitwise,require_jewel_osds
      pgmap v36531161: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
            32724 GB used, 56656 GB / 89380 GB avail
            1325870/8384202 objects degraded (15.814%)
                 516 active+undersized+degraded
                 336 active+clean
                 207 active+recovery_wait+degraded
                  27 peering
                   2 active+recovering+degraded
recovery io 62886 kB/s, 15 objects/s
  client io 3586 kB/s wr, 0 op/s rd, 251 op/s wr

It was only very briefly, but it did worry me a bit, fortunately, we went back to the expected HEALTH_WARN very quickly, and everything finished fine, so I guess nothing to worry.

But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR?

No smart errors, apply and commit latency are all within the expected ranges, the systems basically is healthy.

Curious :-)

MJ
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux