Re: OSD Restart results in "unfound objects"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



http://tracker.ceph.com/issues/16113

I think I found the bug.  Thanks for the report!  Turning off
sortbitwise should be an ok workaround for the moment.
-Sam

On Wed, Jun 1, 2016 at 3:00 PM, Diego Castro
<diego.castro@xxxxxxxxxxxxxx> wrote:
> Yes, it was created as Hammer.
> I haven't faced any issues on the upgrade (despite the well know systemd),
> and after that the cluster didn't show any suspicious behavior.
>
>
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>
> 2016-06-01 18:57 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>:
>>
>> Was this cluster upgraded to jewel?  If so, at what version did it start?
>> -Sam
>>
>> On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro
>> <diego.castro@xxxxxxxxxxxxxx> wrote:
>> > Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait
>> > until
>> > the weekend to push the config.
>> > BTW, i just unset sortbitwise flag.
>> >
>> >
>> > ---
>> > Diego Castro / The CloudFather
>> > GetupCloud.com - Eliminamos a Gravidade
>> >
>> > 2016-06-01 13:39 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>:
>> >>
>> >> Can either of you reproduce with logs?  That would make it a lot
>> >> easier to track down if it's a bug.  I'd want
>> >>
>> >> debug osd = 20
>> >> debug ms = 1
>> >> debug filestore = 20
>> >>
>> >> On all of the osds for a particular pg from when it is clean until it
>> >> develops an unfound object.
>> >> -Sam
>> >>
>> >> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
>> >> <diego.castro@xxxxxxxxxxxxxx> wrote:
>> >> > Hello Uwe, i also have sortbitwise flag enable and i have the exactly
>> >> > behavior of yours.
>> >> > Perhaps this is also the root of my issues, does anybody knows if is
>> >> > safe to
>> >> > disable it?
>> >> >
>> >> >
>> >> > ---
>> >> > Diego Castro / The CloudFather
>> >> > GetupCloud.com - Eliminamos a Gravidade
>> >> >
>> >> > 2016-06-01 7:17 GMT-03:00 Uwe Mesecke <uwe@xxxxxxxxxxx>:
>> >> >>
>> >> >>
>> >> >> > Am 01.06.2016 um 10:25 schrieb Diego Castro
>> >> >> > <diego.castro@xxxxxxxxxxxxxx>:
>> >> >> >
>> >> >> > Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
>> >> >> > Today my cluster suddenly went unhealth with lots of stuck pg's
>> >> >> > due
>> >> >> > unfound objects, no disks failures nor node crashes, it just went
>> >> >> > bad.
>> >> >> >
>> >> >> > I managed to put the cluster on health state again by marking lost
>> >> >> > objects to delete "ceph pg <id> mark_unfound_lost delete".
>> >> >> > Regarding the fact that i have no idea why the cluster gone bad, i
>> >> >> > realized restarting the osd' daemons to unlock stuck clients put
>> >> >> > the
>> >> >> > cluster
>> >> >> > on unhealth and pg gone stuck again due unfound objects.
>> >> >> >
>> >> >> > Does anyone have this issue?
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I also ran into that problem after upgrading to jewel. In my case I
>> >> >> was
>> >> >> able to somewhat correlate this behavior with setting the
>> >> >> sortbitwise
>> >> >> flag
>> >> >> after the upgrade. When the flag is set, after some time these
>> >> >> unfound
>> >> >> objects are popping up. Restarting osds just makes it worse and/or
>> >> >> makes
>> >> >> these problems appear faster. When looking at the missing objects I
>> >> >> can
>> >> >> see
>> >> >> that sometimes even region or zone configuration objects for radosgw
>> >> >> are
>> >> >> missing which I know are there because the radosgw was using these
>> >> >> just
>> >> >> before.
>> >> >>
>> >> >> After unsetting the sortbitwise flag, the PGs go back to normal, all
>> >> >> previously unfound objects are found and the cluster becomes healthy
>> >> >> again.
>> >> >>
>> >> >> Of course I’m not sure whether this is the real root of the problem
>> >> >> or
>> >> >> just a coincidence but I can reproduce this behavior every time.
>> >> >>
>> >> >> So for now the cluster is running without this flag. :-/
>> >> >>
>> >> >> Regards,
>> >> >> Uwe
>> >> >>
>> >> >> >
>> >> >> > ---
>> >> >> > Diego Castro / The CloudFather
>> >> >> > GetupCloud.com - Eliminamos a Gravidade
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@xxxxxxxxxxxxxx
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux