Re: Stuck PGs blocked_by non-existent OSDs

"joel.merrick@xxxxxxxxx" <joel.merrick@xxxxxxxxx> · Thu, 12 Mar 2015 09:22:59 +0000

Thanks Sam, I'll take a look. Seems sensible enough and worth a shot.

We'll probably call it a day after this and flatten in, but I'm
wondering if it's possible some rbd devices may miss these pg's and
could be exportable? Will have a tinker!

On Wed, Mar 11, 2015 at 7:06 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> For each of those pgs, you'll need to identify the pg copy you want to be
> the winner and either
> 1) Remove all of the other ones using ceph-objectstore-tool and hopefully
> the winner you left alone will allow the pg to recover and go active.
> 2) Export the winner using ceph-objectstore-tool, use ceph-objectstore-tool
> to delete *all* copies of the pg, use force_create_pg to recreate the pg
> empty, use ceph-objectstore-tool to do a rados import on the exported pg
> copy.
>
> Also, the pgs which are still down still have replicas which need to be
> brought back or marked lost.
> -Sam
>
>
> On 03/11/2015 07:29 AM, joel.merrick@xxxxxxxxx wrote:
>>
>> I'd like to not have to null them if possible, there's nothing
>> outlandishly valuable, its more the time to reprovision (users have
>> stuff on there, mainly testing but I have a nasty feeling some users
>> won't have backed up their test instances). When you say complicated
>> and fragile, could you expand?
>>
>> Thanks again!
>> Joel
>>
>> On Wed, Mar 11, 2015 at 1:21 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>
>>> Ok, you lost all copies from an interval where the pgs went active. The
>>> recovery from this is going to be complicated and fragile.  Are the pools
>>> valuable?
>>> -Sam
>>>
>>>
>>> On 03/11/2015 03:35 AM, joel.merrick@xxxxxxxxx wrote:
>>>>
>>>> For clarity too, I've tried to drop the min_size before as suggested,
>>>> doesn't make a difference unfortunately
>>>>
>>>> On Wed, Mar 11, 2015 at 9:50 AM, joel.merrick@xxxxxxxxx
>>>> <joel.merrick@xxxxxxxxx> wrote:
>>>>>
>>>>> Sure thing, n.b. I increased pg count to see if it would help. Alas
>>>>> not.
>>>>> :)
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> health_detail
>>>>> https://gist.github.com/199bab6d3a9fe30fbcae
>>>>>
>>>>> osd_dump
>>>>> https://gist.github.com/499178c542fa08cc33bb
>>>>>
>>>>> osd_tree
>>>>> https://gist.github.com/02b62b2501cbd684f9b2
>>>>>
>>>>> Random selected queries:
>>>>> queries/0.19.query
>>>>> https://gist.github.com/f45fea7c85d6e665edf8
>>>>> queries/1.a1.query
>>>>> https://gist.github.com/dd68fbd5e862f94eb3be
>>>>> queries/7.100.query
>>>>> https://gist.github.com/d4fd1fb030c6f2b5e678
>>>>> queries/7.467.query
>>>>> https://gist.github.com/05dbcdc9ee089bd52d0c
>>>>>
>>>>> On Tue, Mar 10, 2015 at 2:49 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> Yeah, get a ceph pg query on one of the stuck ones.
>>>>>> -Sam
>>>>>>
>>>>>> On Tue, 2015-03-10 at 14:41 +0000, joel.merrick@xxxxxxxxx wrote:
>>>>>>>
>>>>>>> Stuck unclean and stuck inactive. I can fire up a full query and
>>>>>>> health dump somewhere useful if you want (full pg query info on ones
>>>>>>> listed in health detail, tree, osd dump etc). There were blocked_by
>>>>>>> operations that no longer exist after doing the OSD addition.
>>>>>>>
>>>>>>> Side note, spent some time yesterday writing some bash to do this
>>>>>>> programatically (might be useful to others, will throw on github)
>>>>>>>
>>>>>>> On Tue, Mar 10, 2015 at 1:41 PM, Samuel Just <sjust@xxxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> What do you mean by "unblocked" but still "stuck"?
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Mon, 2015-03-09 at 22:54 +0000, joel.merrick@xxxxxxxxx wrote:
>>>>>>>>>
>>>>>>>>> On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just <sjust@xxxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> You'll probably have to recreate osds with the same ids (empty
>>>>>>>>>> ones),
>>>>>>>>>> let them boot, stop them, and mark them lost.  There is a feature
>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>> tracker to improve this behavior:
>>>>>>>>>> http://tracker.ceph.com/issues/10976
>>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> Thanks Sam, I've readded the OSDs, they became unblocked but there
>>>>>>>>> are
>>>>>>>>> still the same number of pgs stuck. I looked at them in some more
>>>>>>>>> detail and it seems they all have num_bytes='0'. Tried a repair
>>>>>>>>> too,
>>>>>>>>> for good measure. Still nothing I'm afraid.
>>>>>>>>>
>>>>>>>>> Does this mean some underlying catastrophe has happened and they
>>>>>>>>> are
>>>>>>>>> never going to recover? Following on, would that cause data loss.
>>>>>>>>> There are no missing objects and I'm hoping there's appropriate
>>>>>>>>> checksumming / replicas to balance that out, but now I'm not so
>>>>>>>>> sure.
>>>>>>>>>
>>>>>>>>> Thanks again,
>>>>>>>>> Joel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
>>>>
>>>>
>>>>
>>
>>
>

-- 
$ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com