Help Recovering Ceph cluster

Jon <three18ti@xxxxxxxxx> · Sun, 9 Jun 2013 12:36:56 -0600

Hello All,
I have deployed Ceph with ceph-deploy on three nodes.  Each node is currently play double duty as both a monitor node and an osd node.

This is what my cluster looked like before I rebooted the one node:

>> root@red6:~# ceph -s
>>    health HEALTH_OK

>>    monmap e2: 3 mons at {kitt=192.168.0.35:6789/0,red6=192.168.0.40:6789/0,shepard=192.168.0.2:6789/0}, election epoch 10, quorum 0,1,2 kitt,red6,shepard
>>    osdmap e29: 7 osds: 7 up, 7 in

>>     pgmap v1692: 192 pgs: 192 active+clean; 19935 MB data, 40441 MB used, 4539 GB / 4580 GB avail; 73B/s rd, 0op/s

>>    mdsmap e1: 0/0/1 up

Last week, I rebooted one of my nodes (kitt), and after it rebooted, I was unable to get it to rejoin the cluster.  
Typically, I've been able to use ceph-deploy to delete the monitor then deploy the monitor and everything goes back to normal.
That didn't work this time, so that monitor and those two osds were out of the cluster.  I think when I rebooted the node, 
I got errors that the mon was unreachable and it was taken out of the monmap.  When I tried to delete the monitor ceph-deploy 
was telling me there was not monitor to delete.  When I ran ceph-deploy to deploy the monitor I received no errors, but the monitor never started.

I've tried a number of things in the docs, but something seems amiss because when I try to restart monitors or osds, the init script tells me it's not found.
I've copied my ceph.conf at the end of this e-mail.

>> root@shepard:~# ls /var/lib/ceph/mon/                                                                                   
>> ceph-shepard                                                                                                            
>> root@shepard:~# /etc/init.d/ceph restart mon.shepard
>> /etc/init.d/ceph: mon.shepard not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )
(I've also tried mon.0 .. mon.3 ) 

Same thing with osds:

>> root@shepard:~# ls /var/lib/ceph/osd/                                                                                   
>> ceph-3  ceph-4  ceph-6  sdb.journal  sdc.journal  sdd.journal                                                           
>> root@shepard:~# /etc/init.d/ceph restart osd.4                                                                         
>> /etc/init.d/ceph: osd.4 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines ) 

(I've tried every iteration I can think of, using `service ceph restart`, I've also tried both iterations with the -a flag e.g. `service ceph -a restart`, and stop/start instead of restart, also with and without the -a flag)

That's about where I left it last week, before my work week started.  Other than occasionally reporting something to the effect of:

>> 2013-06-09 11:46:14.978350 7f8098e9a700  0 -- :/18915 >> 192.168.0.35:6789/0 pipe(0x7f8090002ca0 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault

when I ran ceph -s , ceph health reported OK, so I figured it was ok to leave until I was off shift and had the cycles to devote to it...
I really do know better.

Last night I experienced a power outage, and all of my nodes hard rebooted.  Now when I run ceph health (or use any of the ceph or ceph-deploy tools),
I get the following error message:

>> 2013-06-09 12:12:14.692484 7f4506e0e700  0 -- :/24787 >> 192.168.0.40:6789/0 pipe(0x7f44fc000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-06-09 12:12:17.692509 7f450d5a0700  0 -- :/24787 >> 192.168.0.35:6789/0 pipe(0x7f44fc003010 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-06-09 12:12:20.692469 7f4506e0e700  0 -- :/24787 >> 192.168.0.2:6789/0 pipe(0x7f44fc0038c0 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault

this was run on the shepard node (192.168.0.2), but I get the same error from all of the nodes.

I do see a number of ceph processes

>> root@shepard:~# ps aux | grep ceph
>> root      1041  0.0  0.0  33680  7100 ?        Ss   06:11   0:18 /usr/bin/python /usr/sbin/ceph-create-keys --cluster=ceph -i shepard
>> root      1538  0.0  0.0 378904  4284 ?        Sl   06:11   0:04 ceph --cluster=ceph --name=osd.3 --keyring=/var/lib/ceph/osd/ceph-3/keyring osd crush create-or-move -- 3 0.45 root=default host=shepard
>> root      1581  0.0  0.0 378904  4300 ?        Sl   06:11   0:04 ceph --cluster=ceph --name=osd.6 --keyring=/var/lib/ceph/osd/ceph-6/keyring osd crush create-or-move -- 6 0.91 root=default host=shepard
>> root      1628  0.0  0.0 378904  4296 ?        Sl   06:11   0:04 ceph --cluster=ceph --name=osd.4 --keyring=/var/lib/ceph/osd/ceph-4/keyring osd crush create-or-move -- 4 0.45 root=default host=shepard

However, I don't see the monitor process actually running.

My first question is, how can I restart ceph processes?  
My second question is, am I on the right track here?  My hypothesis is starting the monitors will get things going again, because there is nothing is listening on those ports:

>> root@shepard:~# telnet 192.168.0.2 6789
>> Trying 192.168.0.2...
>> telnet: Unable to connect to remote host: Connection refused
>> root@shepard:~# netstat -anp | grep ceph
>> root@shepard:~# netstat -anp | grep 6789

Actually, I just rebooted this node again and now the osds are not running. only ceph-create-keys.

My last question is where does ceph-deploy create the configs?  I have the original files in a directory where I ran ceph deploy, 
and I know about the /etc/ceph/ceph.conf file, but there seems to be some other config that the cluster is pulling from.
Maybe I'm mistaken, but there are no osds in my ceph.conf.

 Thanks for all your help.
Jon A

ceph.conf file:

[global]
fsid = 86f293d0-bec7-4694-ace7-af3fbaa98736
mon_initial_members = red6, kitt, shepard
mon_host = 192.168.0.40,192.168.0.35,192.168.0.2
auth_supported = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com