OSDs will not come up

tsuraan <tsuraan@xxxxxxxxx> · Mon, 20 Oct 2014 10:10:08 -0500

I configured a three-monitor Ceph cluster following the manual
instructions at
http://ceph.com/docs/v0.80.5/install/manual-deployment/ and
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ .  The
monitor cluster came up without a problem, and seems to be fine. "ceph
-s" currently shows this (I didn't capture what it said before I added
the OSDs, but it was probably roughly the same):

    cluster f6c14635-1e04-497e-b782-dbba65c70257
     health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
     monmap e1: 3 mons at
{curly=10.38.56.3:6789/0,larry=10.38.56.2:6789/0,moe=10.38.56.4:6789/0},
election epoch 10, quorum 0,1,2 larry,curly,moe
     osdmap e35: 15 osds: 0 up, 0 in
      pgmap v36: 192 pgs, 3 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                 192 creating

So, aside from the osds, this looks fine. I then added fifteen OSD
daemons, spread between two of the machines in the cluster. I again
followed the instructions in the manual deployment page, which have
always worked for me in the past. This time, none of the daemons are
ever being marked as "up" or "in". Google isn't helping me much
either. What I can determine:

ps awx on the two storage machines does show that the ceph-osd
processes are running (with stable pids).

A sample "netstat -rn |grep 6789" looks like this:

tcp        0      0 10.38.56.2:6789         0.0.0.0:*
LISTEN      55585/ceph-mon
tcp        0      0 10.38.56.2:40219        10.38.56.3:6789
ESTABLISHED 19081/ceph-osd
tcp        0      0 10.38.56.2:6789         10.38.56.3:60891
ESTABLISHED 55585/ceph-mon
tcp        0      0 10.38.56.2:60586        10.38.56.4:6789
ESTABLISHED 9830/ceph-osd
tcp        0      0 10.38.56.2:6789         10.38.56.3:60856
ESTABLISHED 55585/ceph-mon
tcp        0      0 10.38.56.2:60606        10.38.56.4:6789
ESTABLISHED 20424/ceph-osd
tcp        0      0 10.38.56.2:40207        10.38.56.3:6789
ESTABLISHED 13247/ceph-osd
tcp        0      0 10.38.56.2:54488        10.38.56.2:6789
ESTABLISHED 16445/ceph-osd
tcp        0      0 10.38.56.2:60610        10.38.56.4:6789
ESTABLISHED 24939/ceph-osd
tcp        0      0 10.38.56.2:6789         10.38.56.2:54488
ESTABLISHED 55585/ceph-mon
tcp        0      0 10.38.56.2:60560        10.38.56.4:6789
ESTABLISHED 55585/ceph-mon
tcp        0      0 10.38.56.2:40211        10.38.56.3:6789
ESTABLISHED 14662/ceph-osd

The other storage machine looks roughly the same. It looks to me like
the OSDs are running, and are connected to monitors.

"ceph auth list" looks like this (keys blanked out):

installed auth entries:

osd.0
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.1
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.10
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.11
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.12
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.13
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.14
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.2
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.3
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.4
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.5
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.6
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.7
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.8
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
osd.9
    key: XXX
    caps: [mon] allow rwx
    caps: [osd] allow *
client.admin
    key: XXX
    caps: [mds] allow
    caps: [mon] allow *
    caps: [osd] allow *

"ceph osd tree" gives:

# id    weight    type name    up/down    reweight
-1    16.12    root default
-2    8.663        host larry
0    0.932            osd.0    down    0
1    1.4            osd.1    down    0
2    1.4            osd.2    down    0
3    1.4            osd.3    down    0
4    0.932            osd.4    down    0
5    1.9            osd.5    down    0
6    0.699            osd.6    down    0
-3    7.456        host curly
7    0.932            osd.7    down    0
8    0.932            osd.8    down    0
9    0.932            osd.9    down    0
10    0.932            osd.10    down    0
11    0.932            osd.11    down    0
12    0.932            osd.12    down    0
13    0.932            osd.13    down    0
14    0.932            osd.14    down    0

My /var/log/ceph/osd-*.log files don't have anything in them that look
like errors. They mostly end with some lines about "crush map has
features..." that come after "done with init, starting boot process".
On an osd that I restarted, the log just ends with the "starting boot
process" line. Finally, my ceph.conf looks like this:

[global]
fsid = f6c14635-1e04-497e-b782-dbba65c70257
mon initial members = larry,curly,moe
mon host = 10.38.56.2,10.38.56.3,10.38.56.4
public network = 10.38.56.0/24
cluster network = 10.29.38.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 10000
filestore max sync interval = 5
filestore xattr use omap = false
osd pool default size = 2  # Write an object n times.
osd pool default min size = 2 # Allow writing n copy in a degraded state.
osd pool default pg num = 500
osd pool default pgp num = 500
osd crush chooseleaf type = 1
# osd crush chooseleaf type = 0

[osd.0]
public address = 10.38.56.2
cluster address = 10.29.38.2

[osd.1]
public address = 10.38.56.2
cluster address = 10.29.38.2

[osd.2]
public address = 10.38.56.2
cluster address = 10.29.38.2

[osd.3]
public address = 10.38.56.2
cluster address = 10.29.38.2

[osd.4]
public address = 10.38.56.2
cluster address = 10.29.38.2

[osd.5]
public address = 10.38.56.2
cluster address = 10.29.38.2

[osd.6]
public address = 10.38.56.2
cluster address = 10.29.38.2

[osd.7]
public address = 10.38.56.3
cluster address = 10.29.38.3

[osd.8]
public address = 10.38.56.3
cluster address = 10.29.38.3

[osd.9]
public address = 10.38.56.3
cluster address = 10.29.38.3

[osd.10]
public address = 10.38.56.3
cluster address = 10.29.38.3

[osd.11]
public address = 10.38.56.3
cluster address = 10.29.38.3

[osd.12]
public address = 10.38.56.3
cluster address = 10.29.38.3

[osd.13]
public address = 10.38.56.3
cluster address = 10.29.38.3

[osd.14]
public address = 10.38.56.3
cluster address = 10.29.38.3

[mds.0]
host = larry

[mon.curly]
mon addr = 10.38.56.2

[mon.larry]
mon addr = 10.38.56.3

[mon.moe]
mon addr = 10.38.56.4

I added the [mon.X] lines later to see if it would do anything, and it
didn't. I really have no idea what's going on here. Any advice would
be appreciated.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com