Bogus "inactive" errors during OSD restarts with Jewel

Christian Balzer <chibi@xxxxxxx> · Thu, 9 Mar 2017 11:04:42 +0900

Hello,

during OSD restarts with Jewel (10.2.5 and .6 at least) I've seen
"stuck inactive for more than 300 seconds" errors like this when observing
things with "watch ceph -s" : 
---
     health HEALTH_ERR
            59 pgs are stuck inactive for more than 300 seconds
            223 pgs degraded
            74 pgs peering
            84 pgs stale
            59 pgs stuck inactive
            297 pgs stuck unclean
            223 pgs undersized
            recovery 38420/179352 objects degraded (21.422%)
            2/16 in osds are down
---

Now this is is neither reflected in any logs, nor true of course (the
restarts take a few seconds per OSD and the cluster is fully recovered
to HEALTH_OK in 12 seconds or so.

But it surely is a good scare for somebody not doing this on a test
cluster.

Anybody else seeing this?

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com