Re: OSD Restart results in "unfound objects"

SCHAER Frederic <frederic.schaer@xxxxxx> · Wed, 1 Jun 2016 09:58:03 +0000

I do…

In my case, I have collocated the MONs with some OSDs, and no later than Saturday when I lost data again, I found out
 that one of the MON+OSD nodes ran out of memory and started killing ceph-mon on that node…
At the same moment, all OSDs started to complain about not being able to see other OSDs on other machines.

I suspect that when the node runs out of memory, bad things happen with for instance the network (no memory : no network
 buffer ?). But I can’t explain the unfound objects, as in my case, same as yours, nodes did not crash, and ceph-osd did not crash neither – hence, I’m assuming no data was lost because of sudden disk poweroff for instance, or because of any kernel or raid
 controller cache…

For now, I’m considering moving the MONs onto dedicated nodes … hoping the out of memory was my issue.

De : ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
De la part de Diego Castro

Envoyé : mercredi 1 juin 2016 10:25

À : ceph-users <ceph-users@xxxxxxxx>

Objet : [ceph-users] OSD Restart results in "unfound objects"

Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.

Today my cluster suddenly went unhealth with lots of stuck pg's  due unfound objects, no disks failures nor node crashes, it just went bad.

I managed to put the cluster on health state again by marking lost objects to delete "ceph pg <id> mark_unfound_lost delete". 

Regarding the fact that i have no idea why the cluster gone bad, i realized restarting the osd' daemons to unlock stuck clients put the cluster on unhealth and pg gone stuck again due unfound objects.

Does anyone have this issue?

---

Diego Castro / The CloudFather

GetupCloud.com - Eliminamos a Gravidade

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com