My OSDs are down and not coming UP

"Ing. Martin Samek" <samekma1@xxxxxxxxxxx> · Mon, 28 Dec 2015 23:59:12 +0100

Hi,

I'm a newbie in a Ceph world. I try setup my first testing Ceph cluster 
but unlikely my MON server running and talking each to other but my OSDs 
are still down and won't to come up. Actually only the one OSD running 
at the same node as a elected master is able to connect and come UP.

To be technical. I have 4 physical nodes living in pure IPv6 
environment, running Gentoo Linux and Ceph 9.2. All nodes names are 
resolvable in DNS and also saved in hosts files.

I'm running OSD with command like this:

node1# /usr/bin/ceph-osd -f -i 1 --pid-file /run/ceph/osd.1.pid -c /etc/ceph/ceph.conf

single mon.0 is running also at node1, and OSD come up:

2015-12-28 23:37:27.931686 mon.0 [INF] osd.1 [2001:718:2:1612::50]:6800/23709 boot

2015-12-28 23:37:27.932605 mon.0 [INF] osdmap e19: 2 osds: 1 up, 1 in

2015-12-28 23:37:27.933963 mon.0 [INF] pgmap v24: 64 pgs: 64 stale+active+undersized+degraded; 0 bytes data, 1057 MB used, 598 GB / 599 GB avail

but running osd.0 at node2:

# /usr/bin/ceph-osd -f -i 0 --pid-file /run/ceph/osd.0.pid -c /etc/ceph/ceph.conf

did nothing, process is running, netstat shows opened connection from 
ceph-osd between node2 and node1. Here I'm lost. IPv6 connectivity is 
OK, DNS is OK, time is in sync, 1 mon running, 2 osds but only one UP. 
What is missing?

ceph-osd in debug mode show differences at node1 and node2:

node1, UP:
2015-12-28 01:42:59.084371 7f72f9873800 20 osd.1 15  clearing temps in 0.3f_head pgid 0.3f
2015-12-28 01:42:59.084453 7f72f9873800  0 osd.1 15 load_pgs
2015-12-28 01:42:59.085248 7f72f9873800 10 osd.1 15 load_pgs ignoring unrecognized meta
2015-12-28 01:42:59.094690 7f72f9873800 10 osd.1 15 pgid 0.0 coll 0.0_head
2015-12-28 01:42:59.094835 7f72f9873800 30 osd.1 0 get_map 15 -cached
2015-12-28 01:42:59.094848 7f72f9873800 10 osd.1 15 _open_lock_pg 0.0
2015-12-28 01:42:59.094857 7f72f9873800 10 osd.1 15 _get_pool 0
2015-12-28 01:42:59.094928 7f72f9873800  5 osd.1 pg_epoch: 15 pg[0.0(unlocked)] enter Initial
2015-12-28 01:42:59.094980 7f72f9873800 20 osd.1 pg_epoch: 15 pg[0.0(unlocked)] enter NotTrimming
2015-12-28 01:42:59.094998 7f72f9873800 30 osd.1 pg_epoch: 15 pg[0.0( DNE empty local-les=0 n=0 ec=0 les/c/f 0/0/0 0/0/0) [] r=0 lpr=0 crt=0'0 inactive NIBBLEW
2015-12-28 01:42:59.095186 7f72f9873800 20 read_log coll 0.0_head log_oid 0/00000000//head

node2, DOWN:
2015-12-28 01:36:54.437246 7f4507957800  0 osd.0 11 load_pgs
2015-12-28 01:36:54.437267 7f4507957800 10 osd.0 11 load_pgs ignoring unrecognized meta
2015-12-28 01:36:54.437274 7f4507957800  0 osd.0 11 load_pgs opened 0 pgs
2015-12-28 01:36:54.437278 7f4507957800 10 osd.0 11 build_past_intervals_parallel nothing to build
2015-12-28 01:36:54.437282 7f4507957800  2 osd.0 11 superblock: i am osd.0
2015-12-28 01:36:54.437287 7f4507957800 10 osd.0 11 create_logger
2015-12-28 01:36:54.438157 7f4507957800 -1 osd.0 11 log_to_monitors {default=true}
2015-12-28 01:36:54.449278 7f4507957800 10 osd.0 11 set_disk_tp_priority class  priority -1
2015-12-28 01:36:54.450813 7f44ddbff700 30 osd.0 11 heartbeat
2015-12-28 01:36:54.452558 7f44ddbff700 30 osd.0 11 heartbeat checking stats
2015-12-28 01:36:54.452592 7f44ddbff700 20 osd.0 11 update_osd_stat osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist [])
2015-12-28 01:36:54.452611 7f44ddbff700  5 osd.0 11 heartbeat: osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist [])
2015-12-28 01:36:54.452618 7f44ddbff700 30 osd.0 11 heartbeat check
2015-12-28 01:36:54.452622 7f44ddbff700 30 osd.0 11 heartbeat lonely?
2015-12-28 01:36:54.452624 7f44ddbff700 30 osd.0 11 heartbeat done
2015-12-28 01:36:54.452627 7f44ddbff700 30 osd.0 11 heartbeat_entry sleeping for 2.3
2015-12-28 01:36:54.452588 7f44da7fc700 10 osd.0 11 agent_entry start
2015-12-28 01:36:54.453338 7f44da7fc700 20 osd.0 11 agent_entry empty queue

My ceph.conf looks like this:

[global]

fsid = b186d870-9c6d-4a8b-ac8a-e263f4c205da

ms_bind_ipv6 = true

public_network = xxxx:xxxx:2:1612::/64

mon initial members = 0

mon host = [xxxx:xxxx:2:1612::50]:6789

auth cluster required = cephx

auth service required = cephx

auth client required = cephx

osd pool default size = 2

osd pool default min size = 1

osd journal size = 1024

osd mkfs type = xfs

osd mount options xfs = rw,inode64

osd crush chooseleaf type = 1

[mon.0]

host = node1

mon addr = [xxxx:xxxx:2:1612::50]:6789

[mon.1]

host = node3

mon addr = [xxxx:xxxx:2:1612::30]:6789

[osd.0]

host = node2

devs = /dev/vg0/osd0

[osd.1]

host = node1

devs = /dev/vg0/osd

My ceph osd tree:

node1 # ceph osd tree

ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 2.00000 root default

-2 1.00000     host node2

 0 1.00000         osd.0     down        0          1.00000

-3 1.00000     host node1

 1 1.00000         osd.1       up  1.00000          1.00000

Any help how to cope with this is appreciated. I follow steps in this 
guides:

https://wiki.gentoo.org/wiki/Ceph/Installation#Installation
http://docs.ceph.com/docs/master/install/manual-deployment/#adding-osds
http://www.mad-hacking.net/documentation/linux/ha-cluster/storage-area-network/ceph-additional-nodes.xml
http://blog.widodh.nl/2014/05/deploying-ceph-over-ipv6/

Thanks in advance.

Martin

--
====================================
Ing. Martin Samek
   ICT systems engineer
   FELK Admin

Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Control Engineering
Karlovo namesti 13/E, 121 35 Prague
Czech Republic

e-mail:  samekma1@xxxxxxxxxxx
phone: +420 22435 7599
mobile: +420 605 285 125
====================================

Attachment:
smime.p7s

Description: Elektronicky podpis S/MIME
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com