Yep, looks like the same issue: 2016-06-02 00:45:27.977064 7fc11b4e9700 10 osd.17 pg_epoch: 11108 pg[34.4a( v 11104'1080336 lc 11104'1080335 (11069'1077294,11104'1080336] local-les=11108 n=50593 ec=2051 les/c/f 11104/11104/0 11106/11107/11107) [17,13] r=0 lpr=11107 pi=11101-11106/3 crt=11104'1080336 lcod 0'0 mlcod 0'0 inactive m=1 u=1] search_for_mi ssing 34:52a5cefb:::default.3653921.2__shadow_.69E1Tth4Y2Q7m0VKNbQdJe-9BgYks6I_1:head 11104'1080336 also missing on osd.13 (last_backfill MAX but with wrong sort order) Thanks! -Sam On Wed, Jun 1, 2016 at 4:04 PM, Uwe Mesecke <uwe@xxxxxxxxxxx> wrote: > Hey Sam, > > glad you found the bug. As another data point a just did the whole round of "healthy -> set sortbitwise -> osd restarts -> unfound objects -> unset sortbitwise -> healthy" with the debug settings as described by you earlier. > > I uploaded the logfiles... > > https://www.dropbox.com/s/f5hhptbtocbxe1k/ceph-osd.13.log.gz > https://www.dropbox.com/s/kau9cjqfhmtpd89/ceph-osd.17.log.gz > > The PG with the unfound object is „34.4a“ and it seems as there are similar log messages as you noted in the issue. > > The cluster runs jewel 10.2.1 and was created a long time ago, I think it was giant. > > Thanks again! > > Uwe > >> Am 02.06.2016 um 00:19 schrieb Samuel Just <sjust@xxxxxxxxxx>: >> >> http://tracker.ceph.com/issues/16113 >> >> I think I found the bug. Thanks for the report! Turning off >> sortbitwise should be an ok workaround for the moment. >> -Sam >> >> On Wed, Jun 1, 2016 at 3:00 PM, Diego Castro >> <diego.castro@xxxxxxxxxxxxxx> wrote: >>> Yes, it was created as Hammer. >>> I haven't faced any issues on the upgrade (despite the well know systemd), >>> and after that the cluster didn't show any suspicious behavior. >>> >>> >>> --- >>> Diego Castro / The CloudFather >>> GetupCloud.com - Eliminamos a Gravidade >>> >>> 2016-06-01 18:57 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>: >>>> >>>> Was this cluster upgraded to jewel? If so, at what version did it start? >>>> -Sam >>>> >>>> On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro >>>> <diego.castro@xxxxxxxxxxxxxx> wrote: >>>>> Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait >>>>> until >>>>> the weekend to push the config. >>>>> BTW, i just unset sortbitwise flag. >>>>> >>>>> >>>>> --- >>>>> Diego Castro / The CloudFather >>>>> GetupCloud.com - Eliminamos a Gravidade >>>>> >>>>> 2016-06-01 13:39 GMT-03:00 Samuel Just <sjust@xxxxxxxxxx>: >>>>>> >>>>>> Can either of you reproduce with logs? That would make it a lot >>>>>> easier to track down if it's a bug. I'd want >>>>>> >>>>>> debug osd = 20 >>>>>> debug ms = 1 >>>>>> debug filestore = 20 >>>>>> >>>>>> On all of the osds for a particular pg from when it is clean until it >>>>>> develops an unfound object. >>>>>> -Sam >>>>>> >>>>>> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro >>>>>> <diego.castro@xxxxxxxxxxxxxx> wrote: >>>>>>> Hello Uwe, i also have sortbitwise flag enable and i have the exactly >>>>>>> behavior of yours. >>>>>>> Perhaps this is also the root of my issues, does anybody knows if is >>>>>>> safe to >>>>>>> disable it? >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> Diego Castro / The CloudFather >>>>>>> GetupCloud.com - Eliminamos a Gravidade >>>>>>> >>>>>>> 2016-06-01 7:17 GMT-03:00 Uwe Mesecke <uwe@xxxxxxxxxxx>: >>>>>>>> >>>>>>>> >>>>>>>>> Am 01.06.2016 um 10:25 schrieb Diego Castro >>>>>>>>> <diego.castro@xxxxxxxxxxxxxx>: >>>>>>>>> >>>>>>>>> Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon. >>>>>>>>> Today my cluster suddenly went unhealth with lots of stuck pg's >>>>>>>>> due >>>>>>>>> unfound objects, no disks failures nor node crashes, it just went >>>>>>>>> bad. >>>>>>>>> >>>>>>>>> I managed to put the cluster on health state again by marking lost >>>>>>>>> objects to delete "ceph pg <id> mark_unfound_lost delete". >>>>>>>>> Regarding the fact that i have no idea why the cluster gone bad, i >>>>>>>>> realized restarting the osd' daemons to unlock stuck clients put >>>>>>>>> the >>>>>>>>> cluster >>>>>>>>> on unhealth and pg gone stuck again due unfound objects. >>>>>>>>> >>>>>>>>> Does anyone have this issue? >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I also ran into that problem after upgrading to jewel. In my case I >>>>>>>> was >>>>>>>> able to somewhat correlate this behavior with setting the >>>>>>>> sortbitwise >>>>>>>> flag >>>>>>>> after the upgrade. When the flag is set, after some time these >>>>>>>> unfound >>>>>>>> objects are popping up. Restarting osds just makes it worse and/or >>>>>>>> makes >>>>>>>> these problems appear faster. When looking at the missing objects I >>>>>>>> can >>>>>>>> see >>>>>>>> that sometimes even region or zone configuration objects for radosgw >>>>>>>> are >>>>>>>> missing which I know are there because the radosgw was using these >>>>>>>> just >>>>>>>> before. >>>>>>>> >>>>>>>> After unsetting the sortbitwise flag, the PGs go back to normal, all >>>>>>>> previously unfound objects are found and the cluster becomes healthy >>>>>>>> again. >>>>>>>> >>>>>>>> Of course I’m not sure whether this is the real root of the problem >>>>>>>> or >>>>>>>> just a coincidence but I can reproduce this behavior every time. >>>>>>>> >>>>>>>> So for now the cluster is running without this flag. :-/ >>>>>>>> >>>>>>>> Regards, >>>>>>>> Uwe >>>>>>>> >>>>>>>>> >>>>>>>>> --- >>>>>>>>> Diego Castro / The CloudFather >>>>>>>>> GetupCloud.com - Eliminamos a Gravidade >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>> >>>>> >>> >>> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com