On Sat, Aug 4, 2012 at 3:37 AM, Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote: > Hello, > > Yesterday finally I have managed to screw up my installation of ceph! :) > > My ceph was at 80% capacity. I have rebooted one of OSDs remotely and > managed to screw up with fstab. Host failed to come up and while I was > driving from home to my office ceph took recovery action. But it meant that > it has filled up another OSDs completely and it has failed. Ceph continued > to recover and killed other OSDs in the same fashion. Not quite good. > Attempt to restart OSDs was in vain: they were unable to test for xattrs > because file system was full and only growing file system allowed them to > restart. > > Now this leads me to a question/proposal: is there a feature which allows > ceph to halt recovery process if any of live OSDs exceeding say 95% percent > capacity? It is quite distinct from what is considered full or near full OSD > as any writes when OSD is near full or full coming from clients and > inability to write leads to client lock up. But halting recovery should > allow clients to continue even so ceph is in degraded state. It does not > make sense to me to allow ceph go from degraded state to crashed state when > no client needs it. There is not yet any such feature, no — dealing with full systems is notoriously hard and we haven't come up with a great solution yet. One thing you can do is experiment with the "mon_osd_min_in_ratio" parameter, which prevents the monitors from marking out more than a certain percentage of the OSD cluster (and without something being marked out, no data will be moved around). If you don't want the cluster to automatically mark any OSDs out, you can also set the "mon_osd_down_out_interval" to zero. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html