Re: Multiple OSDs down, and won't come up (possibly related to other Nautilus issues)

aoanla@xxxxxxxxx · Thu, 02 Apr 2020 13:32:38 -0000

So, the recovery stalled a few more OSDs in, but looking at the disks with OSDs marked down, I noticed that, despite systemctl reporting that the OSD processes were all *up*, several of them had not written to their logs since they rotated.

Suspecting that these OSDs were stalled, I've started logging into each OSD host and doing:

ls -lh /var/log/ceph/*.log

checking for logs with a size of 0, 

and then 

systemctl restart ceph-osd@xxx 

for all xxx with zero sized logs. 
(I've checked each of these first with 
systemctl status ceph-osd xxx 
and they all report that the process is up...)

This seems to be helping recovery dramatically...

but if I look in the logs for each of the "frozen" OSDs before I restart them [obviously, in the rotated log], there's no sign of why the crash actually happens - there's a lot of complaining about how they can't talk to other OSDs as in previous emails in this thread, and then suddenly, nothing.

It would be lovely if anyone could comment on thoughts about what's happening here.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx