Cascading Failure of OSDs

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Fri, 6 Mar 2015 17:33:18 -0700

So I'm in the middle of trying to triage a problem with my ceph cluster running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has been running happily for about a year. This last weekend, something caused the box running the MDS to sieze hard, and when we came in on monday, several OSDs were down or unresponsive. I brought the MDS and the OSDs back on online, and managed to get things running again with minimal data loss. Had to mark a few objects as lost, but things were apparently running fine at the end of the day on Monday.
This afternoon, I noticed that one of the OSDs was apparently stuck in a crash/restart loop, and the cluster was unhappy. Performance was in the tank and "ceph status" is reporting all manner of problems, as one would expect if an OSD is misbehaving. I marked the offending OSD out, and the cluster started rebalancing as expected. However, I noticed a short while later, another OSD has started into a crash/restart loop. So, I repeat the process. And it happens again. At this point I notice, that there are actually two at a time which are in this state.

It's as if there's some toxic chunk of data that is getting passed around, and when it lands on an OSD it kills it. Contrary to that, however, I tried just stopping an OSD when it's in a bad state, and once the cluster starts to try rebalancing with that OSD down and not previously marked out, another OSD will start crash-looping.

I've investigated the disk of the first OSD I found with this problem, and it has no apparent corruption on the file system.

I'll follow up to this shortly with links to pastes of log snippets. Any input would be appreciated. This is turning into a real cascade failure, and I haven't any idea how to stop it.

QH
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com