So, the recovery stalled a few more OSDs in, but looking at the disks with OSDs marked down, I noticed that, despite systemctl reporting that the OSD processes were all *up*, several of them had not written to their logs since they rotated. Suspecting that these OSDs were stalled, I've started logging into each OSD host and doing: ls -lh /var/log/ceph/*.log checking for logs with a size of 0, and then systemctl restart ceph-osd@xxx for all xxx with zero sized logs. (I've checked each of these first with systemctl status ceph-osd xxx and they all report that the process is up...) This seems to be helping recovery dramatically... but if I look in the logs for each of the "frozen" OSDs before I restart them [obviously, in the rotated log], there's no sign of why the crash actually happens - there's a lot of complaining about how they can't talk to other OSDs as in previous emails in this thread, and then suddenly, nothing. It would be lovely if anyone could comment on thoughts about what's happening here. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx