I do… In my case, I have collocated the MONs with some OSDs, and no later than Saturday when I lost data again, I found out
that one of the MON+OSD nodes ran out of memory and started killing ceph-mon on that node… At the same moment, all OSDs started to complain about not being able to see other OSDs on other machines. I suspect that when the node runs out of memory, bad things happen with for instance the network (no memory : no network
buffer ?). But I can’t explain the unfound objects, as in my case, same as yours, nodes did not crash, and ceph-osd did not crash neither – hence, I’m assuming no data was lost because of sudden disk poweroff for instance, or because of any kernel or raid
controller cache… For now, I’m considering moving the MONs onto dedicated nodes … hoping the out of memory was my issue. De : ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
De la part de Diego Castro Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon. Today my cluster suddenly went unhealth with lots of stuck pg's due unfound objects, no disks failures nor node crashes, it just went bad. I managed to put the cluster on health state again by marking lost objects to delete "ceph pg <id> mark_unfound_lost delete". Regarding the fact that i have no idea why the cluster gone bad, i realized restarting the osd' daemons to unlock stuck clients put the cluster on unhealth and pg gone stuck again due unfound objects. Does anyone have this issue? --- Diego Castro / The CloudFather |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com