Re: OSD Restart results in "unfound objects"

Samuel Just <sjust@xxxxxxxxxx> · Wed, 1 Jun 2016 09:39:32 -0700

Can either of you reproduce with logs?  That would make it a lot
easier to track down if it's a bug.  I'd want

debug osd = 20
debug ms = 1
debug filestore = 20

On all of the osds for a particular pg from when it is clean until it
develops an unfound object.
-Sam

On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
<diego.castro@xxxxxxxxxxxxxx> wrote:
> Hello Uwe, i also have sortbitwise flag enable and i have the exactly
> behavior of yours.
> Perhaps this is also the root of my issues, does anybody knows if is safe to
> disable it?
>
>
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>
> 2016-06-01 7:17 GMT-03:00 Uwe Mesecke <uwe@xxxxxxxxxxx>:
>>
>>
>> > Am 01.06.2016 um 10:25 schrieb Diego Castro
>> > <diego.castro@xxxxxxxxxxxxxx>:
>> >
>> > Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
>> > Today my cluster suddenly went unhealth with lots of stuck pg's  due
>> > unfound objects, no disks failures nor node crashes, it just went bad.
>> >
>> > I managed to put the cluster on health state again by marking lost
>> > objects to delete "ceph pg <id> mark_unfound_lost delete".
>> > Regarding the fact that i have no idea why the cluster gone bad, i
>> > realized restarting the osd' daemons to unlock stuck clients put the cluster
>> > on unhealth and pg gone stuck again due unfound objects.
>> >
>> > Does anyone have this issue?
>>
>> Hi,
>>
>> I also ran into that problem after upgrading to jewel. In my case I was
>> able to somewhat correlate this behavior with setting the sortbitwise flag
>> after the upgrade. When the flag is set, after some time these unfound
>> objects are popping up. Restarting osds just makes it worse and/or makes
>> these problems appear faster. When looking at the missing objects I can see
>> that sometimes even region or zone configuration objects for radosgw are
>> missing which I know are there because the radosgw was using these just
>> before.
>>
>> After unsetting the sortbitwise flag, the PGs go back to normal, all
>> previously unfound objects are found and the cluster becomes healthy again.
>>
>> Of course I’m not sure whether this is the real root of the problem or
>> just a coincidence but I can reproduce this behavior every time.
>>
>> So for now the cluster is running without this flag. :-/
>>
>> Regards,
>> Uwe
>>
>> >
>> > ---
>> > Diego Castro / The CloudFather
>> > GetupCloud.com - Eliminamos a Gravidade
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com