Re: Bogus "inactive" errors during OSD restarts with Jewel

Ruben Kerkhof <ruben@xxxxxxxxxxxxxxxx> · Thu, 9 Mar 2017 16:59:23 +0100

On Thu, Mar 9, 2017 at 3:04 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
>
> Hello,
>
> during OSD restarts with Jewel (10.2.5 and .6 at least) I've seen
> "stuck inactive for more than 300 seconds" errors like this when observing
> things with "watch ceph -s" :
> ---
>      health HEALTH_ERR
>             59 pgs are stuck inactive for more than 300 seconds
>             223 pgs degraded
>             74 pgs peering
>             84 pgs stale
>             59 pgs stuck inactive
>             297 pgs stuck unclean
>             223 pgs undersized
>             recovery 38420/179352 objects degraded (21.422%)
>             2/16 in osds are down
> ---
>
> Now this is is neither reflected in any logs, nor true of course (the
> restarts take a few seconds per OSD and the cluster is fully recovered
> to HEALTH_OK in 12 seconds or so.
>
> But it surely is a good scare for somebody not doing this on a test
> cluster.
>
> Anybody else seeing this?

Definitely. ceph -w shows them as well. They indeed always clear after
a few seconds.

>
> Christian
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

Kind regards,

Ruben Kerkhof
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com