Update a day later: the cluster is *very slowly* recovering, it looks like: we're now at 113 OSDs down (improved from 140 OSDs down when everything broke) - but it took a day before anything changed here, and it looks like we're recovering at a rate of about 1 -2 OSDs per hour... So I'm not just talking to myself here, it would be great if anyone could offer any suggestions about: what might have happened, or how I can stop this happening again? It's not sustainable to have a Ceph cluster take several days to recover from half its OSDs just "deciding to all go down". _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx