Hi, I had a working Ceph Hammer test setup with 3 OSDs and 1 MON (running on VMs), and RBD was working fine. The setup was not touched for two weeks (also no I/O activity), and when I looked again, the cluster was in a bad state: On the MON node (sto-vm20): $ ceph health HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down $ ceph health detail HEALTH_WARN 72 pgs stale; 72 pgs stuck stale; 3/3 in osds are down pg 0.22 is stuck stale for 1457679.263525, current state stale+active +clean, last acting [2,1,0] pg 0.21 is stuck stale for 1457679.263529, current state stale+active +clean, last acting [1,2,0] pg 0.20 is stuck stale for 1457679.263531, current state stale+active +clean, last acting [1,0,2] pg 0.1f is stuck stale for 1457679.263533, current state stale+active +clean, last acting [2,0,1] ... pg 0.24 is stuck stale for 1457679.263625, current state stale+active +clean, last acting [2,0,1] pg 0.23 is stuck stale for 1457679.263627, current state stale+active +clean, last acting [1,2,0] osd.0 is down since epoch 16, last address 9.4.68.111:6800/1658 osd.1 is down since epoch 16, last address 9.4.68.112:6800/1659 osd.2 is down since epoch 16, last address 9.4.68.113:6800/1654 On the OSD nodes (sto-vm21, sto-vm22, sto-vm23), no Ceph daemon is running: $ ps -ef | egrep "ceph|osd|rados" (returns nothing) I rebooted the OSDs as well as the MON, but still only the ceph-mon daemon is running on the MON node. I tried to start the OSDs manually by executing $ sudo /etc/init.d/ceph start osd on the OSD nodes, but I saw neither an error message nor alogfile update. On the OSD nodes, the log files in /var/log/ceph have no longer been updated since the failure event. What is strange is that the OSDs no longer have any admin socket files (which should normally be in /run/ceph), whereas the MON node does have an admin socket: $ ls -la /run/ceph srwxr-xr-x 1 root root 0 Jul 7 15:27 ceph-mon.sto-vm20.asok This looks very similar to http://tracker.ceph.com/issues/7188 Bug #7188: Admin socket files are lost on log rotation calling initctl reload (ubuntu 13.04 only) Any ideas how to restart / recover the OSDs are much appreciated. How can I start the OSD daemon(s) such that I can see any errors? Thanks, - Fredy PS: The Ceph setup is on Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-41-generic x86_64) _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com