Ceph OSDs are down and cannot be started

Fredy Neeser <nfd@xxxxxxxxxxxxxx> · Tue, 7 Jul 2015 18:15:23 +0200

Hi,

I had a working Ceph Hammer test setup with 3 OSDs and 1 MON (running on
VMs), and RBD was working fine.

The setup was not touched for two weeks (also no I/O activity), and when I
looked again, the cluster was in a bad state:

On the MON node (sto-vm20):
$ ceph health
HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down

$ ceph health detail
HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down
pg 0.22 is stuck stale for 1457679.263525, current state stale+active
+clean, last acting [2,1,0]
pg 0.21 is stuck stale for 1457679.263529, current state stale+active
+clean, last acting [1,2,0]
pg 0.20 is stuck stale for 1457679.263531, current state stale+active
+clean, last acting [1,0,2]
pg 0.1f is stuck stale for 1457679.263533, current state stale+active
+clean, last acting [2,0,1]
...
pg 0.24 is stuck stale for 1457679.263625, current state stale+active
+clean, last acting [2,0,1]
pg 0.23 is stuck stale for 1457679.263627, current state stale+active
+clean, last acting [1,2,0]
osd.0 is down since epoch 16, last address 9.4.68.111:6800/1658
osd.1 is down since epoch 16, last address 9.4.68.112:6800/1659
osd.2 is down since epoch 16, last address 9.4.68.113:6800/1654

On the OSD nodes (sto-vm21, sto-vm22, sto-vm23), no Ceph daemon is running:
$ ps -ef | egrep "ceph|osd|rados"
(returns nothing)

I rebooted the OSDs  as well as the MON, but still only the ceph-mon daemon
is running on the MON node.

I tried to start the OSDs manually by executing
$ sudo /etc/init.d/ceph start osd
on the OSD nodes, but I saw neither an error message nor alogfile update.

On the OSD nodes, the log files in /var/log/ceph have no longer been
updated since the failure event.

What is strange is that the OSDs no longer have any admin socket files
(which should normally be in /run/ceph), whereas the MON node does have an
admin socket:
$ ls -la /run/ceph
srwxr-xr-x  1 root root   0 Jul  7 15:27 ceph-mon.sto-vm20.asok

This looks very similar to
http://tracker.ceph.com/issues/7188
Bug #7188: Admin socket files are lost on log rotation calling initctl
reload (ubuntu 13.04 only)

Any ideas how to restart / recover the OSDs are much appreciated.
How can I start the OSD daemon(s) such that I can see any errors?

Thanks,
- Fredy

PS: The Ceph setup is on  Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-41-generic
x86_64)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com