This is going to sound odd and if I hadn't been issuing all commands on the monitor I would swear I issued 'rm -rf' from the shell of the osd in the /var/lib/osd/ceph-s/ directory. After creating the pool/rbd and getting an error from 'rbd info' I saw an osd down/out so I went to it's shell and the ceph-osd daemon code is gone. I'll assume I erased it, but how do I recover this cluster without doing a purge/purgedata reinstall? I bought up a new cluster. All pages are 'active+clean' and all 3 OSD's are UP/IN. [root at essperf3 Ceph]# ceph -s cluster 32c48975-bb57-47f6-8138-e152452e3bbe health HEALTH_OK monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1, quorum 0 essperf3 osdmap e8: 3 osds: 3 up, 3 in pgmap v13: 192 pgs, 3 pools, 0 bytes data, 0 objects 10106 MB used, 1148 GB / 1158 GB avail 192 active+clean [root at essperf3 Ceph]# ceph osd tree # id weight type name up/down reweight -1 1.13 root default -2 0.45 host ess51 0 0.45 osd.0 up 1 -3 0.23 host ess52 1 0.23 osd.1 up 1 -4 0.45 host ess59 2 0.45 osd.2 up 1 [root at essperf3 Ceph]# Next created a test pool and a 1GB rbd and listed it [root at essperf3 Ceph]# ceph osd pool create testpool 75 75 pool 'testpool' created [root at essperf3 Ceph]# ceph osd lspools 0 data,1 metadata,2 rbd,3 testpool, [root at essperf3 Ceph]# rbd create testimage --size 1024 --pool testpool [root at essperf3 Ceph]# rbd ls testpool testimage [root at essperf3 Ceph]# When I look at the 'info' output I start seeing problems. [root at essperf3 Ceph]# rbd --image testimage info rbd: error opening image testimage: (2) No such file or directory2014-08-04 18:39:33.602263 7fc4b9e80760 -1 librbd::ImageCtx: error finding header: (2) No such file or directory [root at essperf3 Ceph]# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 693G 683G 10073M 1.42 POOLS: NAME ID USED %USED OBJECTS data 0 0 0 0 metadata 1 0 0 0 rbd 2 0 0 0 testpool 3 137 0 2 [root at essperf3 Ceph]# ceph -s cluster 32c48975-bb57-47f6-8138-e152452e3bbe health HEALTH_WARN 267 pgs degraded; 100 pgs stuck unclean; recovery 2/6 objects degraded (33.333%) monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1, quorum 0 essperf3 osdmap e21: 3 osds: 2 up, 2 in pgmap v48: 267 pgs, 4 pools, 137 bytes data, 2 objects 10073 MB used, 683 GB / 693 GB avail 2/6 objects degraded (33.333%) 267 active+degraded client io 17 B/s rd, 0 op/s [root at essperf3 Ceph]# Check to see which OSD is down: [root at essperf3 Ceph]# ceph osd tree # id weight type name up/down reweight -1 1.13 root default -2 0.45 host ess51 0 0.45 osd.0 up 1 -3 0.23 host ess52 1 0.23 osd.1 up 1 -4 0.45 host ess59 2 0.45 osd.2 down 0 [root at essperf3 Ceph]# Then go to the shell on ess59: and restart the osd: (This is where it gets rather odd) My ceph.conf has debug osd = 20 debug ms = 1 and I expect to see output from the /etc/init.d/ceph restart osd and I see nothing. With a little digging I see that the /var/lib/ceph/osd/ceph-2/ directory is EMPTY. There is no ceph-osd daemon. It's almost like I did a 'rm -rf ' on that directory from the shell of ess59/osd.2 yet all commands have been executed on the monitor. [root at ess59 ceph]# ip addr | grep .59 inet 10.10.40.59/24 brd 10.10.40.255 scope global em1 inet6 fe80::92b1:1cff:fe18:659f/64 scope link inet 209.243.160.59/24 brd 209.243.160.255 scope global em2 inet 10.10.50.59/24 brd 10.10.50.255 scope global p6p2 [root at ess59 ceph]# ll /var/lib/ceph/osd/ total 4 drwxr-xr-x 2 root root 4096 Aug 4 14:46 ceph-2 [root at ess59 ceph]# ll /var/lib/ceph/ total 24 drwxr-xr-x 2 root root 4096 Jul 29 18:36 bootstrap-mds drwxr-xr-x 2 root root 4096 Aug 4 14:23 bootstrap-osd drwxr-xr-x 2 root root 4096 Jul 29 18:36 mds drwxr-xr-x 2 root root 4096 Jul 29 18:36 mon drwxr-xr-x 3 root root 4096 Aug 4 14:46 osd drwxr-xr-x 2 root root 4096 Aug 4 18:14 tmp [root at ess59 ceph]# ll /var/lib/ceph/osd/ceph-2/ total 0 [root at ess59 ceph]# Looking at the monitor logs I see osd.2 boot and even see where osd.2 leaves the cluster, but how do I lose the daemon. How do I recover/repair the OSD without having to reinstall the cluster .... Again? 2014-08-04 14:47:10.008426 mon.0 [INF] pgmap v13: 192 pgs: 192 active+clean; 0 bytes data, 10106 MB used, 1148 GB / 1158 GB avail 2014-08-04 14:49:08.854988 mon.0 [INF] pgmap v14: 192 pgs: 192 active+clean; 0 bytes data, 10106 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:38:55.529118 mon.0 [INF] osdmap e9: 3 osds: 3 up, 3 in 2014-08-04 16:38:55.588920 mon.0 [INF] pgmap v15: 267 pgs: 75 creating, 192 active+clean; 0 bytes data, 10106 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:38:56.674507 mon.0 [INF] osdmap e10: 3 osds: 3 up, 3 in 2014-08-04 16:38:56.707256 mon.0 [INF] pgmap v16: 267 pgs: 75 creating, 192 active+clean; 0 bytes data, 10106 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:39:01.182508 mon.0 [INF] pgmap v17: 267 pgs: 56 creating, 2 peering, 209 active+clean; 0 bytes data, 10107 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:39:02.265569 mon.0 [INF] pgmap v18: 267 pgs: 2 inactive, 20 active, 6 peering, 239 active+clean; 0 bytes data, 10108 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:39:06.371070 mon.0 [INF] pgmap v19: 267 pgs: 2 inactive, 20 active, 4 peering, 241 active+clean; 0 bytes data, 10108 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:39:07.484259 mon.0 [INF] pgmap v20: 267 pgs: 267 active+clean; 0 bytes data, 10108 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:41:06.227435 mon.0 [INF] pgmap v21: 267 pgs: 267 active+clean; 0 bytes data, 10108 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:48:01.178851 mon.0 [INF] osd.2 209.243.160.59:6800/21186 failed (3 reports from 2 peers after 24.931114 >= grace 20.000000) 2014-08-04 16:48:01.320953 mon.0 [INF] osdmap e11: 3 osds: 2 up, 3 in 2014-08-04 16:48:01.355520 mon.0 [INF] pgmap v22: 267 pgs: 100 stale+active+clean, 167 active+clean; 0 bytes data, 10108 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:48:02.465783 mon.0 [INF] osdmap e12: 3 osds: 2 up, 3 in 2014-08-04 16:48:02.498833 mon.0 [INF] pgmap v23: 267 pgs: 100 stale+active+clean, 167 active+clean; 0 bytes data, 10108 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:48:07.279702 mon.0 [INF] pgmap v24: 267 pgs: 71 stale+active+clean, 90 active+degraded, 106 active+clean; 0 bytes data, 10109 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:48:08.352741 mon.0 [INF] pgmap v25: 267 pgs: 267 active+degraded; 0 bytes data, 10110 MB used, 1148 GB / 1158 GB avail 2014-08-04 16:48:22.268630 mon.0 [INF] pgmap v26: 267 pgs: 267 active+degraded; 112 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 68 B/s wr, 0 op/s; 2/3 objects degraded (66.667%) 2014-08-04 16:48:23.389449 mon.0 [INF] pgmap v27: 267 pgs: 267 active+degraded; 137 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 0 B/s rd, 135 B/s wr, 0 op/s; 4/6 objects degraded (66.667%) 2014-08-04 16:50:22.290200 mon.0 [INF] pgmap v28: 267 pgs: 267 active+degraded; 137 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 4/6 objects degraded (66.667%) 2014-08-04 16:50:23.352788 mon.0 [INF] pgmap v29: 267 pgs: 267 active+degraded; 137 bytes data, 10110 MB used, 1148 GB / 1158 GB avail; 4/6 objects degraded (66.667%) 2014-08-04 16:53:02.014805 mon.0 [INF] osd.2 out (down for 300.695457) 2014-08-04 16:53:02.119534 mon.0 [INF] osdmap e13: 3 osds: 2 up, 2 in -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140805/b02181a8/attachment.htm>