Re: pg incomplete state

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 21 Oct 2015 11:21:54 -0700



On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson <jpr@xxxxxxx> wrote:
> Hi folks
>
> I've been rebuilding drives in my cluster to add space.  This has gone
> well so far.
>
> After the last batch of rebuilds, I'm left with one placement group in
> an incomplete state.
>
> [sudo] password for jpr:
> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
> pg 3.ea is stuck inactive since forever, current state incomplete, last
> acting [30,11]
> pg 3.ea is stuck unclean since forever, current state incomplete, last
> acting [30,11]
> pg 3.ea is incomplete, acting [30,11]
>
> I've restarted both OSD a few times but it hasn't cleared the error.
>
> On the primary I see errors in the log related to slow requests:
>
> 2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
> 3 included below; oldest blocked for > 31.922487 secs
> 2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.00000000a044 [read
> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
> 2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.00000000a044 [read
> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
> 2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
> 3.e4bd50ea) v4 currently reached pg
>
> Note's online suggest this is an issue with the journal and that it may
> be possible to export and rebuild thepg.  I don't have firefly.
>
> https://ceph.com/community/incomplete-pgs-oh-my/
>
> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
> but missing entirely on osd.30 (the primary).
>
> on osd.33 (primary):
>
> crowbar@da0-36-9f-0e-2b-88:~$ du -sk
> /var/lib/ceph/osd/ceph-30/current/3.ea_head/
> 0       /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>
> on osd.11 (secondary):
>
> crowbar@da0-36-9f-0e-2b-40:~$ du -sh
> /var/lib/ceph/osd/ceph-11/current/3.ea_head/
> 63G     /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>
> This makes some sense since, my disk drive rebuilding activity
> reformatted the primary osd.30.  It also gives me some hope that my data
> is not lost.
>
> I understand incomplete means problem with journal, but is there a way
> to dig deeper into this or possible to get the secondary's data to take
> over?

If you're running an older version of Ceph (Firefly or earlier,
maybe?), "incomplete" can also mean "not enough replicas". It looks
like that's what you're hitting here, if osd.11 is not reporting any
issues. If so, simply setting the min_size on this pool to 1 until the
backfilling is done should let you get going.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com