Re: OSD Restart results in "unfound objects"

Samuel Just <sjust@xxxxxxxxxx> · Wed, 1 Jun 2016 16:20:27 -0700

Yep, looks like the same issue:

2016-06-02 00:45:27.977064 7fc11b4e9700 10 osd.17 pg_epoch: 11108
pg[34.4a( v 11104'1080336 lc 11104'1080335
(11069'1077294,11104'1080336] local-les=11108 n=50593 ec=2051 les/c/f
11104/11104/0 11106/11107/11107) [17,13] r=0 lpr=11107
pi=11101-11106/3 crt=11104'1080336 lcod 0'0 mlcod 0'0 inactive m=1
u=1] search_for_mi
ssing 34:52a5cefb:::default.3653921.2__shadow_.69E1Tth4Y2Q7m0VKNbQdJe-9BgYks6I_1:head
11104'1080336 also missing on osd.13 (last_backfill MAX but with wrong
sort order)

Thanks!
-Sam

On Wed, Jun 1, 2016 at 4:04 PM, Uwe Mesecke <uwe@xxxxxxxxxxx> wrote:
> Hey Sam,
>
> glad you found the bug. As another data point a just did the whole round of "healthy -> set sortbitwise -> osd restarts -> unfound objects -> unset sortbitwise -> healthy" with the debug settings as described by you earlier.
>
> I uploaded the logfiles...
>
> https://www.dropbox.com/s/f5hhptbtocbxe1k/ceph-osd.13.log.gz
> https://www.dropbox.com/s/kau9cjqfhmtpd89/ceph-osd.17.log.gz
>
> The PG with the unfound object is „34.4a“ and it seems as there are similar log messages as you noted in the issue.
>
> The cluster runs jewel 10.2.1 and was created a long time ago, I think it was giant.
>
> Thanks again!
>
> Uwe
>
>> Am 02.06.2016 um 00:19 schrieb Samuel Just <sjust@xxxxxxxxxx>:
>>
>> http://tracker.ceph.com/issues/16113
>>
>> I think I found the bug.  Thanks for the report!  Turning off
>> sortbitwise should be an ok workaround for the moment.
>> -Sam
>>
>> On Wed, Jun 1, 2016 at 3:00 PM, Diego Castro
>> <diego.castro@xxxxxxxxxxxxxx> wrote:
>>> Yes, it was created as Hammer.
>>> I haven't faced any issues on the upgrade (despite the well know systemd),
>>> and after that the cluster didn't show any suspicious behavior.
>>>
>>>
>>> ---
>>> Diego Castro / The CloudFather
>>> GetupCloud.com - Eliminamos a Gravidade
>>>
>>> 2016-06-01 18:57 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>:
>>>>
>>>> Was this cluster upgraded to jewel?  If so, at what version did it start?
>>>> -Sam
>>>>
>>>> On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro
>>>> <diego.castro@xxxxxxxxxxxxxx> wrote:
>>>>> Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait
>>>>> until
>>>>> the weekend to push the config.
>>>>> BTW, i just unset sortbitwise flag.
>>>>>
>>>>>
>>>>> ---
>>>>> Diego Castro / The CloudFather
>>>>> GetupCloud.com - Eliminamos a Gravidade
>>>>>
>>>>> 2016-06-01 13:39 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>:
>>>>>>
>>>>>> Can either of you reproduce with logs?  That would make it a lot
>>>>>> easier to track down if it's a bug.  I'd want
>>>>>>
>>>>>> debug osd = 20
>>>>>> debug ms = 1
>>>>>> debug filestore = 20
>>>>>>
>>>>>> On all of the osds for a particular pg from when it is clean until it
>>>>>> develops an unfound object.
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
>>>>>> <diego.castro@xxxxxxxxxxxxxx> wrote:
>>>>>>> Hello Uwe, i also have sortbitwise flag enable and i have the exactly
>>>>>>> behavior of yours.
>>>>>>> Perhaps this is also the root of my issues, does anybody knows if is
>>>>>>> safe to
>>>>>>> disable it?
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> Diego Castro / The CloudFather
>>>>>>> GetupCloud.com - Eliminamos a Gravidade
>>>>>>>
>>>>>>> 2016-06-01 7:17 GMT-03:00 Uwe Mesecke <uwe@xxxxxxxxxxx>:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Am 01.06.2016 um 10:25 schrieb Diego Castro
>>>>>>>>> <diego.castro@xxxxxxxxxxxxxx>:
>>>>>>>>>
>>>>>>>>> Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
>>>>>>>>> Today my cluster suddenly went unhealth with lots of stuck pg's
>>>>>>>>> due
>>>>>>>>> unfound objects, no disks failures nor node crashes, it just went
>>>>>>>>> bad.
>>>>>>>>>
>>>>>>>>> I managed to put the cluster on health state again by marking lost
>>>>>>>>> objects to delete "ceph pg <id> mark_unfound_lost delete".
>>>>>>>>> Regarding the fact that i have no idea why the cluster gone bad, i
>>>>>>>>> realized restarting the osd' daemons to unlock stuck clients put
>>>>>>>>> the
>>>>>>>>> cluster
>>>>>>>>> on unhealth and pg gone stuck again due unfound objects.
>>>>>>>>>
>>>>>>>>> Does anyone have this issue?
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I also ran into that problem after upgrading to jewel. In my case I
>>>>>>>> was
>>>>>>>> able to somewhat correlate this behavior with setting the
>>>>>>>> sortbitwise
>>>>>>>> flag
>>>>>>>> after the upgrade. When the flag is set, after some time these
>>>>>>>> unfound
>>>>>>>> objects are popping up. Restarting osds just makes it worse and/or
>>>>>>>> makes
>>>>>>>> these problems appear faster. When looking at the missing objects I
>>>>>>>> can
>>>>>>>> see
>>>>>>>> that sometimes even region or zone configuration objects for radosgw
>>>>>>>> are
>>>>>>>> missing which I know are there because the radosgw was using these
>>>>>>>> just
>>>>>>>> before.
>>>>>>>>
>>>>>>>> After unsetting the sortbitwise flag, the PGs go back to normal, all
>>>>>>>> previously unfound objects are found and the cluster becomes healthy
>>>>>>>> again.
>>>>>>>>
>>>>>>>> Of course I’m not sure whether this is the real root of the problem
>>>>>>>> or
>>>>>>>> just a coincidence but I can reproduce this behavior every time.
>>>>>>>>
>>>>>>>> So for now the cluster is running without this flag. :-/
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Uwe
>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Diego Castro / The CloudFather
>>>>>>>>> GetupCloud.com - Eliminamos a Gravidade
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>
>>>>>
>>>
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com