Re: My OSDs are down and not coming UP

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Mon, 28 Dec 2015 23:08:51 +0000

It could be a network issue..May be related to MTU (?)..Try running with debug_ms = 1 and see if you find anything..Also, try running command like 'traceroute' and see if it is reporting any error..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Ing. Martin Samek
Sent: Monday, December 28, 2015 2:59 PM
To: Ceph Users
Subject:  My OSDs are down and not coming UP

Hi,

I'm a newbie in a Ceph world. I try setup my first testing Ceph cluster but unlikely my MON server running and talking each to other but my OSDs are still down and won't to come up. Actually only the one OSD running at the same node as a elected master is able to connect and come UP.

To be technical. I have 4 physical nodes living in pure IPv6 environment, running Gentoo Linux and Ceph 9.2. All nodes names are resolvable in DNS and also saved in hosts files.

I'm running OSD with command like this:

node1# /usr/bin/ceph-osd -f -i 1 --pid-file /run/ceph/osd.1.pid -c /etc/ceph/ceph.conf

single mon.0 is running also at node1, and OSD come up:

2015-12-28 23:37:27.931686 mon.0 [INF] osd.1 [2001:718:2:1612::50]:6800/23709 boot

2015-12-28 23:37:27.932605 mon.0 [INF] osdmap e19: 2 osds: 1 up, 1 in

2015-12-28 23:37:27.933963 mon.0 [INF] pgmap v24: 64 pgs: 64 stale+active+undersized+degraded; 0 bytes data, 1057 MB used, 598 GB / 599 GB avail

but running osd.0 at node2:

# /usr/bin/ceph-osd -f -i 0 --pid-file /run/ceph/osd.0.pid -c /etc/ceph/ceph.conf

did nothing, process is running, netstat shows opened connection from ceph-osd between node2 and node1. Here I'm lost. IPv6 connectivity is OK, DNS is OK, time is in sync, 1 mon running, 2 osds but only one UP. 
What is missing?

ceph-osd in debug mode show differences at node1 and node2:

node1, UP:
> 2015-12-28 01:42:59.084371 7f72f9873800 20 osd.1 15  clearing temps in 
> 0.3f_head pgid 0.3f
> 2015-12-28 01:42:59.084453 7f72f9873800  0 osd.1 15 load_pgs
> 2015-12-28 01:42:59.085248 7f72f9873800 10 osd.1 15 load_pgs ignoring 
> unrecognized meta
> 2015-12-28 01:42:59.094690 7f72f9873800 10 osd.1 15 pgid 0.0 coll 
> 0.0_head
> 2015-12-28 01:42:59.094835 7f72f9873800 30 osd.1 0 get_map 15 -cached
> 2015-12-28 01:42:59.094848 7f72f9873800 10 osd.1 15 _open_lock_pg 0.0
> 2015-12-28 01:42:59.094857 7f72f9873800 10 osd.1 15 _get_pool 0
> 2015-12-28 01:42:59.094928 7f72f9873800  5 osd.1 pg_epoch: 15 
> pg[0.0(unlocked)] enter Initial
> 2015-12-28 01:42:59.094980 7f72f9873800 20 osd.1 pg_epoch: 15 
> pg[0.0(unlocked)] enter NotTrimming
> 2015-12-28 01:42:59.094998 7f72f9873800 30 osd.1 pg_epoch: 15 pg[0.0( 
> DNE empty local-les=0 n=0 ec=0 les/c/f 0/0/0 0/0/0) [] r=0 lpr=0 
> crt=0'0 inactive NIBBLEW
> 2015-12-28 01:42:59.095186 7f72f9873800 20 read_log coll 0.0_head 
> log_oid 0/00000000//head

node2, DOWN:
> 2015-12-28 01:36:54.437246 7f4507957800  0 osd.0 11 load_pgs
> 2015-12-28 01:36:54.437267 7f4507957800 10 osd.0 11 load_pgs ignoring 
> unrecognized meta
> 2015-12-28 01:36:54.437274 7f4507957800  0 osd.0 11 load_pgs opened 0 
> pgs
> 2015-12-28 01:36:54.437278 7f4507957800 10 osd.0 11 
> build_past_intervals_parallel nothing to build
> 2015-12-28 01:36:54.437282 7f4507957800  2 osd.0 11 superblock: i am 
> osd.0
> 2015-12-28 01:36:54.437287 7f4507957800 10 osd.0 11 create_logger
> 2015-12-28 01:36:54.438157 7f4507957800 -1 osd.0 11 log_to_monitors 
> {default=true}
> 2015-12-28 01:36:54.449278 7f4507957800 10 osd.0 11 
> set_disk_tp_priority class  priority -1
> 2015-12-28 01:36:54.450813 7f44ddbff700 30 osd.0 11 heartbeat
> 2015-12-28 01:36:54.452558 7f44ddbff700 30 osd.0 11 heartbeat checking 
> stats
> 2015-12-28 01:36:54.452592 7f44ddbff700 20 osd.0 11 update_osd_stat 
> osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist 
> [])
> 2015-12-28 01:36:54.452611 7f44ddbff700  5 osd.0 11 heartbeat: 
> osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist 
> [])
> 2015-12-28 01:36:54.452618 7f44ddbff700 30 osd.0 11 heartbeat check
> 2015-12-28 01:36:54.452622 7f44ddbff700 30 osd.0 11 heartbeat lonely?
> 2015-12-28 01:36:54.452624 7f44ddbff700 30 osd.0 11 heartbeat done
> 2015-12-28 01:36:54.452627 7f44ddbff700 30 osd.0 11 heartbeat_entry 
> sleeping for 2.3
> 2015-12-28 01:36:54.452588 7f44da7fc700 10 osd.0 11 agent_entry start
> 2015-12-28 01:36:54.453338 7f44da7fc700 20 osd.0 11 agent_entry empty 
> queue

My ceph.conf looks like this:

[global]

fsid = b186d870-9c6d-4a8b-ac8a-e263f4c205da

ms_bind_ipv6 = true

public_network = xxxx:xxxx:2:1612::/64

mon initial members = 0

mon host = [xxxx:xxxx:2:1612::50]:6789

auth cluster required = cephx

auth service required = cephx

auth client required = cephx

osd pool default size = 2

osd pool default min size = 1

osd journal size = 1024

osd mkfs type = xfs

osd mount options xfs = rw,inode64

osd crush chooseleaf type = 1

[mon.0]

host = node1

mon addr = [xxxx:xxxx:2:1612::50]:6789

[mon.1]

host = node3

mon addr = [xxxx:xxxx:2:1612::30]:6789

[osd.0]

host = node2

devs = /dev/vg0/osd0

[osd.1]

host = node1

devs = /dev/vg0/osd

My ceph osd tree:

node1 # ceph osd tree

ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 2.00000 root default

-2 1.00000     host node2

  0 1.00000         osd.0     down        0          1.00000

-3 1.00000     host node1

  1 1.00000         osd.1       up  1.00000          1.00000

Any help how to cope with this is appreciated. I follow steps in this
guides:

https://wiki.gentoo.org/wiki/Ceph/Installation#Installation
http://docs.ceph.com/docs/master/install/manual-deployment/#adding-osds
http://www.mad-hacking.net/documentation/linux/ha-cluster/storage-area-network/ceph-additional-nodes.xml
http://blog.widodh.nl/2014/05/deploying-ceph-over-ipv6/

Thanks in advance.

Martin

--
====================================
Ing. Martin Samek
    ICT systems engineer
    FELK Admin

Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Control Engineering
Karlovo namesti 13/E, 121 35 Prague
Czech Republic

e-mail:  samekma1@xxxxxxxxxxx
phone: +420 22435 7599
mobile: +420 605 285 125
====================================

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com