Re: Ceph OSDs are down and cannot be started

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Tue, 7 Jul 2015 17:49:08 +0000

Run :
'ceph-osd -i 0 -f' in a console and see what is the output.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Fredy Neeser
Sent: Tuesday, July 07, 2015 9:15 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject:  Ceph OSDs are down and cannot be started

Hi,

I had a working Ceph Hammer test setup with 3 OSDs and 1 MON (running on VMs), and RBD was working fine.

The setup was not touched for two weeks (also no I/O activity), and when I looked again, the cluster was in a bad state:

On the MON node (sto-vm20):
$ ceph health
HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down

$ ceph health detail
HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down pg 0.22 is stuck stale for 1457679.263525, current state stale+active
+clean, last acting [2,1,0]
pg 0.21 is stuck stale for 1457679.263529, current state stale+active
+clean, last acting [1,2,0]
pg 0.20 is stuck stale for 1457679.263531, current state stale+active
+clean, last acting [1,0,2]
pg 0.1f is stuck stale for 1457679.263533, current state stale+active
+clean, last acting [2,0,1]
...
pg 0.24 is stuck stale for 1457679.263625, current state stale+active
+clean, last acting [2,0,1]
pg 0.23 is stuck stale for 1457679.263627, current state stale+active
+clean, last acting [1,2,0]
osd.0 is down since epoch 16, last address 9.4.68.111:6800/1658
osd.1 is down since epoch 16, last address 9.4.68.112:6800/1659
osd.2 is down since epoch 16, last address 9.4.68.113:6800/1654

On the OSD nodes (sto-vm21, sto-vm22, sto-vm23), no Ceph daemon is running:
$ ps -ef | egrep "ceph|osd|rados"
(returns nothing)

I rebooted the OSDs  as well as the MON, but still only the ceph-mon daemon is running on the MON node.

I tried to start the OSDs manually by executing $ sudo /etc/init.d/ceph start osd on the OSD nodes, but I saw neither an error message nor alogfile update.

On the OSD nodes, the log files in /var/log/ceph have no longer been updated since the failure event.

What is strange is that the OSDs no longer have any admin socket files (which should normally be in /run/ceph), whereas the MON node does have an admin socket:
$ ls -la /run/ceph
srwxr-xr-x  1 root root   0 Jul  7 15:27 ceph-mon.sto-vm20.asok

This looks very similar to
http://tracker.ceph.com/issues/7188
Bug #7188: Admin socket files are lost on log rotation calling initctl reload (ubuntu 13.04 only)

Any ideas how to restart / recover the OSDs are much appreciated.
How can I start the OSD daemon(s) such that I can see any errors?

Thanks,
- Fredy

PS: The Ceph setup is on  Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-41-generic
x86_64)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com