Re: OSD Restart results in "unfound objects"

Uwe Mesecke <uwe@xxxxxxxxxxx> · Thu, 2 Jun 2016 01:04:05 +0200

Hey Sam,

glad you found the bug. As another data point a just did the whole round of "healthy -> set sortbitwise -> osd restarts -> unfound objects -> unset sortbitwise -> healthy" with the debug settings as described by you earlier.

I uploaded the logfiles...

https://www.dropbox.com/s/f5hhptbtocbxe1k/ceph-osd.13.log.gz
https://www.dropbox.com/s/kau9cjqfhmtpd89/ceph-osd.17.log.gz

The PG with the unfound object is „34.4a“ and it seems as there are similar log messages as you noted in the issue.

The cluster runs jewel 10.2.1 and was created a long time ago, I think it was giant.

Thanks again!

Uwe

> Am 02.06.2016 um 00:19 schrieb Samuel Just <sjust@xxxxxxxxxx>:
> 
> http://tracker.ceph.com/issues/16113
> 
> I think I found the bug.  Thanks for the report!  Turning off
> sortbitwise should be an ok workaround for the moment.
> -Sam
> 
> On Wed, Jun 1, 2016 at 3:00 PM, Diego Castro
> <diego.castro@xxxxxxxxxxxxxx> wrote:
>> Yes, it was created as Hammer.
>> I haven't faced any issues on the upgrade (despite the well know systemd),
>> and after that the cluster didn't show any suspicious behavior.
>> 
>> 
>> ---
>> Diego Castro / The CloudFather
>> GetupCloud.com - Eliminamos a Gravidade
>> 
>> 2016-06-01 18:57 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>:
>>> 
>>> Was this cluster upgraded to jewel?  If so, at what version did it start?
>>> -Sam
>>> 
>>> On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro
>>> <diego.castro@xxxxxxxxxxxxxx> wrote:
>>>> Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait
>>>> until
>>>> the weekend to push the config.
>>>> BTW, i just unset sortbitwise flag.
>>>> 
>>>> 
>>>> ---
>>>> Diego Castro / The CloudFather
>>>> GetupCloud.com - Eliminamos a Gravidade
>>>> 
>>>> 2016-06-01 13:39 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>:
>>>>> 
>>>>> Can either of you reproduce with logs?  That would make it a lot
>>>>> easier to track down if it's a bug.  I'd want
>>>>> 
>>>>> debug osd = 20
>>>>> debug ms = 1
>>>>> debug filestore = 20
>>>>> 
>>>>> On all of the osds for a particular pg from when it is clean until it
>>>>> develops an unfound object.
>>>>> -Sam
>>>>> 
>>>>> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
>>>>> <diego.castro@xxxxxxxxxxxxxx> wrote:
>>>>>> Hello Uwe, i also have sortbitwise flag enable and i have the exactly
>>>>>> behavior of yours.
>>>>>> Perhaps this is also the root of my issues, does anybody knows if is
>>>>>> safe to
>>>>>> disable it?
>>>>>> 
>>>>>> 
>>>>>> ---
>>>>>> Diego Castro / The CloudFather
>>>>>> GetupCloud.com - Eliminamos a Gravidade
>>>>>> 
>>>>>> 2016-06-01 7:17 GMT-03:00 Uwe Mesecke <uwe@xxxxxxxxxxx>:
>>>>>>> 
>>>>>>> 
>>>>>>>> Am 01.06.2016 um 10:25 schrieb Diego Castro
>>>>>>>> <diego.castro@xxxxxxxxxxxxxx>:
>>>>>>>> 
>>>>>>>> Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
>>>>>>>> Today my cluster suddenly went unhealth with lots of stuck pg's
>>>>>>>> due
>>>>>>>> unfound objects, no disks failures nor node crashes, it just went
>>>>>>>> bad.
>>>>>>>> 
>>>>>>>> I managed to put the cluster on health state again by marking lost
>>>>>>>> objects to delete "ceph pg <id> mark_unfound_lost delete".
>>>>>>>> Regarding the fact that i have no idea why the cluster gone bad, i
>>>>>>>> realized restarting the osd' daemons to unlock stuck clients put
>>>>>>>> the
>>>>>>>> cluster
>>>>>>>> on unhealth and pg gone stuck again due unfound objects.
>>>>>>>> 
>>>>>>>> Does anyone have this issue?
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I also ran into that problem after upgrading to jewel. In my case I
>>>>>>> was
>>>>>>> able to somewhat correlate this behavior with setting the
>>>>>>> sortbitwise
>>>>>>> flag
>>>>>>> after the upgrade. When the flag is set, after some time these
>>>>>>> unfound
>>>>>>> objects are popping up. Restarting osds just makes it worse and/or
>>>>>>> makes
>>>>>>> these problems appear faster. When looking at the missing objects I
>>>>>>> can
>>>>>>> see
>>>>>>> that sometimes even region or zone configuration objects for radosgw
>>>>>>> are
>>>>>>> missing which I know are there because the radosgw was using these
>>>>>>> just
>>>>>>> before.
>>>>>>> 
>>>>>>> After unsetting the sortbitwise flag, the PGs go back to normal, all
>>>>>>> previously unfound objects are found and the cluster becomes healthy
>>>>>>> again.
>>>>>>> 
>>>>>>> Of course I’m not sure whether this is the real root of the problem
>>>>>>> or
>>>>>>> just a coincidence but I can reproduce this behavior every time.
>>>>>>> 
>>>>>>> So for now the cluster is running without this flag. :-/
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Uwe
>>>>>>> 
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> Diego Castro / The CloudFather
>>>>>>>> GetupCloud.com - Eliminamos a Gravidade
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>> 
>>>> 
>> 
>> 

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com