Re: pg incomplete state

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 21 Oct 2015 13:01:59 -0700

I don't remember the exact timeline, but min_size is designed to
prevent data loss from under-replicated objects (ie, if you only have
1 copy out of 3 and you lose that copy, you're in trouble, so maybe
you don't want it to go active). Unfortunately it could also prevent
the OSDs from replicating/backfilling the data to new OSDs in the case
where you only had one copy left — that's fixed now, but wasn't
initially. And in that case it reported the PG as incomplete (in later
versions, PGs in this state get reported as undersized).

So if you drop the min_size to 1, it will allow new writes to the PG
(which might not be great), but it will also let the OSD go into the
backfilling state. (At least, assuming the number of replicas is the
only problem.). Based on your description of the problem I think this
is the state you're in, and decreasing min_size is the solution.
*shrug*
You could also try and do something like extracting the PG from osd.11
and copying it to osd.30, but that's quite tricky without the modern
objectstore tool stuff, and I don't know if any of that works on
dumpling (which it sounds like you're on — incidentally, you probably
want to upgrade from that).
-Greg

On Wed, Oct 21, 2015 at 12:55 PM, John-Paul Robinson <jpr@xxxxxxx> wrote:
> Greg,
>
> Thanks for the insight.  I suspect things are somewhat sane given that I
> did erase the primary (osd.30) and the secondary (osd.11) still contains
> pg data.
>
> If I may, could you clarify the process of backfill a little?
>
> I understand the min_size allows I/O on the object to resume while there
> are only that many replicas (ie. 1 once changed) and this would let
> things move forward.
>
> I would expect, however, that some backfill would already be on-going
> for pg 3.ea on osd.30.  As far as I can tell, there isn't anything
> happening.  The pg 3.ea directory is just as empty today as it was
> yesterday.
>
> Will changing the min_size actually trigger backfill to begin for an
> object if has stalled or never got started?
>
> An alternative idea I had was to take osd.30 back out of the cluster so
> that pg 3.ae [30,11] would get mapped to some other osd to maintain
> replication.  This seems a bit heavy handed though, given that only this
> one pg is affected.
>
> Thanks for any follow up.
>
> ~jpr
>
>
> On 10/21/2015 01:21 PM, Gregory Farnum wrote:
>> On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson <jpr@xxxxxxx> wrote:
>>> Hi folks
>>>
>>> I've been rebuilding drives in my cluster to add space.  This has gone
>>> well so far.
>>>
>>> After the last batch of rebuilds, I'm left with one placement group in
>>> an incomplete state.
>>>
>>> [sudo] password for jpr:
>>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
>>> pg 3.ea is stuck inactive since forever, current state incomplete, last
>>> acting [30,11]
>>> pg 3.ea is stuck unclean since forever, current state incomplete, last
>>> acting [30,11]
>>> pg 3.ea is incomplete, acting [30,11]
>>>
>>> I've restarted both OSD a few times but it hasn't cleared the error.
>>>
>>> On the primary I see errors in the log related to slow requests:
>>>
>>> 2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
>>> 3 included below; oldest blocked for > 31.922487 secs
>>> 2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
>>> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
>>> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.00000000a044 [read
>>> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
>>> 2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
>>> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
>>> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.00000000a044 [read
>>> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
>>> 2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
>>> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
>>> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
>>> 3.e4bd50ea) v4 currently reached pg
>>>
>>> Note's online suggest this is an issue with the journal and that it may
>>> be possible to export and rebuild thepg.  I don't have firefly.
>>>
>>> https://ceph.com/community/incomplete-pgs-oh-my/
>>>
>>> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
>>> but missing entirely on osd.30 (the primary).
>>>
>>> on osd.33 (primary):
>>>
>>> crowbar@da0-36-9f-0e-2b-88:~$ du -sk
>>> /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>> 0       /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>>
>>> on osd.11 (secondary):
>>>
>>> crowbar@da0-36-9f-0e-2b-40:~$ du -sh
>>> /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>> 63G     /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>>
>>> This makes some sense since, my disk drive rebuilding activity
>>> reformatted the primary osd.30.  It also gives me some hope that my data
>>> is not lost.
>>>
>>> I understand incomplete means problem with journal, but is there a way
>>> to dig deeper into this or possible to get the secondary's data to take
>>> over?
>> If you're running an older version of Ceph (Firefly or earlier,
>> maybe?), "incomplete" can also mean "not enough replicas". It looks
>> like that's what you're hitting here, if osd.11 is not reporting any
>> issues. If so, simply setting the min_size on this pool to 1 until the
>> backfilling is done should let you get going.
>> -Greg
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com