recover from node failure / monitor and osds do not come back

"Diedrich Ehlerding" <diedrich.ehlerding@xxxxxxxxxxxxxx> · Wed, 26 Feb 2014 13:05:54 +0100

My configuration is: two osd servers, one admin node, three monitors; 
all running 072.2

I had to switch of one of the OSD servers. The ngood news is: As 
expected, all clients survived and continued to work with the 
cluster, and the cluster entered a "health warn" state (one monitor 
down, 5 of 10 osds down).

The bad news is: I cannot resume this server's operation. When I 
booted the server, the monitor was started automatically - but it did 
not join the cluster. "/etc/init.d/ceph start mon" says "already 
running", but ceph -s still says that one monitor (this one)  is 
missing.

And the OSDs do not come back; nor can I restart them; error message 
is:

# /etc/init.d/ceph start osd.0
/etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines 
mon.hvrrzrx301 , /var/lib/ceph defines mon.hvrrzrx301)

As expected, ceph osd tree display the osd as down:

-1      2.7     root default
-2      1.35            host hvrrzrx301
0       0.27                    osd.0   down    0
2       0.27                    osd.2   down    0
4       0.27                    osd.4   down    0
6       0.27                    osd.6   down    0
8       0.27                    osd.8   down    0
-3      1.35            host hvrrzrx303
1       0.27                    osd.1   up      1
3       0.27                    osd.3   up      1
5       0.27                    osd.5   up      1
7       0.27                    osd.7   up      1
9       0.27                    osd.9   up      1

my ceph.conf only contains those settings which "ceph-deploy new " 
installed there; i.e. the osds are not mentioned in ceph.conf. I 
assume that this is the problem with my osds? Apparently the cluster 
(the surviving monitors) still know that osd.0, osd.2 etc. should 
appear in the failed node.

Alas, I couldnt find any descritpion how to configure osds within 
ceph.conf ... I tried to define
[osd.0]
host = my_server
devs = /dev/sdb2
data = /var/lib/ceph/osd/ceph-0

but now it complains now that no filesystem type is defined ...

To summarize: where can I find rules and procedures how to set up a 
ceph.conf not only by ceph-deploy; what must I do in addition to 
ceph-deploy in order that I can survive a node outage and can 
reattach the node to the cluster, with respect to the monitor on that 
node as well as to the osds?

best regards
-- 
Diedrich Ehlerding, Fujitsu Technology Solutions GmbH,
FTS CE SC PS&IS W, Hildesheimer Str 25, D-30880 Laatzen
Fon +49 511 8489-1806, Fax -251806, Mobil +49 173 2464758
Firmenangaben: http://de.ts.fujitsu.com/imprint.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com