> Am 01.06.2016 um 10:25 schrieb Diego Castro <diego.castro@xxxxxxxxxxxxxx>: > > Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon. > Today my cluster suddenly went unhealth with lots of stuck pg's due unfound objects, no disks failures nor node crashes, it just went bad. > > I managed to put the cluster on health state again by marking lost objects to delete "ceph pg <id> mark_unfound_lost delete". > Regarding the fact that i have no idea why the cluster gone bad, i realized restarting the osd' daemons to unlock stuck clients put the cluster on unhealth and pg gone stuck again due unfound objects. > > Does anyone have this issue? Hi, I also ran into that problem after upgrading to jewel. In my case I was able to somewhat correlate this behavior with setting the sortbitwise flag after the upgrade. When the flag is set, after some time these unfound objects are popping up. Restarting osds just makes it worse and/or makes these problems appear faster. When looking at the missing objects I can see that sometimes even region or zone configuration objects for radosgw are missing which I know are there because the radosgw was using these just before. After unsetting the sortbitwise flag, the PGs go back to normal, all previously unfound objects are found and the cluster becomes healthy again. Of course I’m not sure whether this is the real root of the problem or just a coincidence but I can reproduce this behavior every time. So for now the cluster is running without this flag. :-/ Regards, Uwe > > --- > Diego Castro / The CloudFather > GetupCloud.com - Eliminamos a Gravidade > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com