Re: Stuck PGs blocked_by non-existent OSDs

"joel.merrick@xxxxxxxxx" <joel.merrick@xxxxxxxxx> · Wed, 11 Mar 2015 14:29:55 +0000

I'd like to not have to null them if possible, there's nothing
outlandishly valuable, its more the time to reprovision (users have
stuff on there, mainly testing but I have a nasty feeling some users
won't have backed up their test instances). When you say complicated
and fragile, could you expand?

Thanks again!
Joel

On Wed, Mar 11, 2015 at 1:21 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> Ok, you lost all copies from an interval where the pgs went active. The
> recovery from this is going to be complicated and fragile.  Are the pools
> valuable?
> -Sam
>
>
> On 03/11/2015 03:35 AM, joel.merrick@xxxxxxxxx wrote:
>>
>> For clarity too, I've tried to drop the min_size before as suggested,
>> doesn't make a difference unfortunately
>>
>> On Wed, Mar 11, 2015 at 9:50 AM, joel.merrick@xxxxxxxxx
>> <joel.merrick@xxxxxxxxx> wrote:
>>>
>>> Sure thing, n.b. I increased pg count to see if it would help. Alas not.
>>> :)
>>>
>>> Thanks again!
>>>
>>> health_detail
>>> https://gist.github.com/199bab6d3a9fe30fbcae
>>>
>>> osd_dump
>>> https://gist.github.com/499178c542fa08cc33bb
>>>
>>> osd_tree
>>> https://gist.github.com/02b62b2501cbd684f9b2
>>>
>>> Random selected queries:
>>> queries/0.19.query
>>> https://gist.github.com/f45fea7c85d6e665edf8
>>> queries/1.a1.query
>>> https://gist.github.com/dd68fbd5e862f94eb3be
>>> queries/7.100.query
>>> https://gist.github.com/d4fd1fb030c6f2b5e678
>>> queries/7.467.query
>>> https://gist.github.com/05dbcdc9ee089bd52d0c
>>>
>>> On Tue, Mar 10, 2015 at 2:49 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>>
>>>> Yeah, get a ceph pg query on one of the stuck ones.
>>>> -Sam
>>>>
>>>> On Tue, 2015-03-10 at 14:41 +0000, joel.merrick@xxxxxxxxx wrote:
>>>>>
>>>>> Stuck unclean and stuck inactive. I can fire up a full query and
>>>>> health dump somewhere useful if you want (full pg query info on ones
>>>>> listed in health detail, tree, osd dump etc). There were blocked_by
>>>>> operations that no longer exist after doing the OSD addition.
>>>>>
>>>>> Side note, spent some time yesterday writing some bash to do this
>>>>> programatically (might be useful to others, will throw on github)
>>>>>
>>>>> On Tue, Mar 10, 2015 at 1:41 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> What do you mean by "unblocked" but still "stuck"?
>>>>>> -Sam
>>>>>>
>>>>>> On Mon, 2015-03-09 at 22:54 +0000, joel.merrick@xxxxxxxxx wrote:
>>>>>>>
>>>>>>> On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> You'll probably have to recreate osds with the same ids (empty
>>>>>>>> ones),
>>>>>>>> let them boot, stop them, and mark them lost.  There is a feature in
>>>>>>>> the
>>>>>>>> tracker to improve this behavior:
>>>>>>>> http://tracker.ceph.com/issues/10976
>>>>>>>> -Sam
>>>>>>>
>>>>>>> Thanks Sam, I've readded the OSDs, they became unblocked but there
>>>>>>> are
>>>>>>> still the same number of pgs stuck. I looked at them in some more
>>>>>>> detail and it seems they all have num_bytes='0'. Tried a repair too,
>>>>>>> for good measure. Still nothing I'm afraid.
>>>>>>>
>>>>>>> Does this mean some underlying catastrophe has happened and they are
>>>>>>> never going to recover? Following on, would that cause data loss.
>>>>>>> There are no missing objects and I'm hoping there's appropriate
>>>>>>> checksumming / replicas to balance that out, but now I'm not so sure.
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> Joel
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
>>
>>
>>
>

-- 
$ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com