Re: My OSDs are down and not coming UP

"Ing. Martin Samek" <samekma1@xxxxxxxxxxx> · Tue, 29 Dec 2015 00:21:22 +0100

Hi,
all nodes are in one VLAN connected to one switch. Conectivity is OK, 
MTU 1500, can transfer data over netcat and mbuffer at 660 Mbps.

debug_ms, there is nothing interest:

/usr/bin/ceph-osd --debug_ms 100 -f -i 0 --pid-file /run/ceph/osd.0.pid -c /etc/ceph/ceph.conf

starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal

2015-12-29 00:18:05.878954 7fd9892e7800 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway

2015-12-29 00:18:05.899633 7fd9892e7800 -1 osd.0 24 log_to_monitors {default=true}

Thanks,
Martin

Dne 29.12.2015 v 00:08 Somnath Roy napsal(a):
It could be a network issue..May be related to MTU (?)..Try running with debug_ms = 1 and see if you find anything..Also, try running command like 'traceroute' and see if it is reporting any error..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Ing. Martin Samek
Sent: Monday, December 28, 2015 2:59 PM
To: Ceph Users
Subject:  My OSDs are down and not coming UP

Hi,

I'm a newbie in a Ceph world. I try setup my first testing Ceph cluster but unlikely my MON server running and talking each to other but my OSDs are still down and won't to come up. Actually only the one OSD running at the same node as a elected master is able to connect and come UP.

To be technical. I have 4 physical nodes living in pure IPv6 environment, running Gentoo Linux and Ceph 9.2. All nodes names are resolvable in DNS and also saved in hosts files.

I'm running OSD with command like this:

node1# /usr/bin/ceph-osd -f -i 1 --pid-file /run/ceph/osd.1.pid -c /etc/ceph/ceph.conf

single mon.0 is running also at node1, and OSD come up:

2015-12-28 23:37:27.931686 mon.0 [INF] osd.1 [2001:718:2:1612::50]:6800/23709 boot

2015-12-28 23:37:27.932605 mon.0 [INF] osdmap e19: 2 osds: 1 up, 1 in

2015-12-28 23:37:27.933963 mon.0 [INF] pgmap v24: 64 pgs: 64 stale+active+undersized+degraded; 0 bytes data, 1057 MB used, 598 GB / 599 GB avail

but running osd.0 at node2:

# /usr/bin/ceph-osd -f -i 0 --pid-file /run/ceph/osd.0.pid -c /etc/ceph/ceph.conf

did nothing, process is running, netstat shows opened connection from ceph-osd between node2 and node1. Here I'm lost. IPv6 connectivity is OK, DNS is OK, time is in sync, 1 mon running, 2 osds but only one UP.
What is missing?

ceph-osd in debug mode show differences at node1 and node2:

node1, UP:
2015-12-28 01:42:59.084371 7f72f9873800 20 osd.1 15  clearing temps in
0.3f_head pgid 0.3f
2015-12-28 01:42:59.084453 7f72f9873800  0 osd.1 15 load_pgs
2015-12-28 01:42:59.085248 7f72f9873800 10 osd.1 15 load_pgs ignoring
unrecognized meta
2015-12-28 01:42:59.094690 7f72f9873800 10 osd.1 15 pgid 0.0 coll
0.0_head
2015-12-28 01:42:59.094835 7f72f9873800 30 osd.1 0 get_map 15 -cached
2015-12-28 01:42:59.094848 7f72f9873800 10 osd.1 15 _open_lock_pg 0.0
2015-12-28 01:42:59.094857 7f72f9873800 10 osd.1 15 _get_pool 0
2015-12-28 01:42:59.094928 7f72f9873800  5 osd.1 pg_epoch: 15
pg[0.0(unlocked)] enter Initial
2015-12-28 01:42:59.094980 7f72f9873800 20 osd.1 pg_epoch: 15
pg[0.0(unlocked)] enter NotTrimming
2015-12-28 01:42:59.094998 7f72f9873800 30 osd.1 pg_epoch: 15 pg[0.0(
DNE empty local-les=0 n=0 ec=0 les/c/f 0/0/0 0/0/0) [] r=0 lpr=0
crt=0'0 inactive NIBBLEW
2015-12-28 01:42:59.095186 7f72f9873800 20 read_log coll 0.0_head
log_oid 0/00000000//head
node2, DOWN:
2015-12-28 01:36:54.437246 7f4507957800  0 osd.0 11 load_pgs
2015-12-28 01:36:54.437267 7f4507957800 10 osd.0 11 load_pgs ignoring
unrecognized meta
2015-12-28 01:36:54.437274 7f4507957800  0 osd.0 11 load_pgs opened 0
pgs
2015-12-28 01:36:54.437278 7f4507957800 10 osd.0 11
build_past_intervals_parallel nothing to build
2015-12-28 01:36:54.437282 7f4507957800  2 osd.0 11 superblock: i am
osd.0
2015-12-28 01:36:54.437287 7f4507957800 10 osd.0 11 create_logger
2015-12-28 01:36:54.438157 7f4507957800 -1 osd.0 11 log_to_monitors
{default=true}
2015-12-28 01:36:54.449278 7f4507957800 10 osd.0 11
set_disk_tp_priority class  priority -1
2015-12-28 01:36:54.450813 7f44ddbff700 30 osd.0 11 heartbeat
2015-12-28 01:36:54.452558 7f44ddbff700 30 osd.0 11 heartbeat checking
stats
2015-12-28 01:36:54.452592 7f44ddbff700 20 osd.0 11 update_osd_stat
osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist
[])
2015-12-28 01:36:54.452611 7f44ddbff700  5 osd.0 11 heartbeat:
osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist
[])
2015-12-28 01:36:54.452618 7f44ddbff700 30 osd.0 11 heartbeat check
2015-12-28 01:36:54.452622 7f44ddbff700 30 osd.0 11 heartbeat lonely?
2015-12-28 01:36:54.452624 7f44ddbff700 30 osd.0 11 heartbeat done
2015-12-28 01:36:54.452627 7f44ddbff700 30 osd.0 11 heartbeat_entry
sleeping for 2.3
2015-12-28 01:36:54.452588 7f44da7fc700 10 osd.0 11 agent_entry start
2015-12-28 01:36:54.453338 7f44da7fc700 20 osd.0 11 agent_entry empty
queue
My ceph.conf looks like this:

[global]

fsid = b186d870-9c6d-4a8b-ac8a-e263f4c205da

ms_bind_ipv6 = true

public_network = xxxx:xxxx:2:1612::/64

mon initial members = 0

mon host = [xxxx:xxxx:2:1612::50]:6789

auth cluster required = cephx

auth service required = cephx

auth client required = cephx

osd pool default size = 2

osd pool default min size = 1

osd journal size = 1024

osd mkfs type = xfs

osd mount options xfs = rw,inode64

osd crush chooseleaf type = 1

[mon.0]

host = node1

mon addr = [xxxx:xxxx:2:1612::50]:6789

[mon.1]

host = node3

mon addr = [xxxx:xxxx:2:1612::30]:6789

[osd.0]

host = node2

devs = /dev/vg0/osd0

[osd.1]

host = node1

devs = /dev/vg0/osd

My ceph osd tree:

node1 # ceph osd tree

ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 2.00000 root default

-2 1.00000     host node2

   0 1.00000         osd.0     down        0          1.00000

-3 1.00000     host node1

   1 1.00000         osd.1       up  1.00000          1.00000

Any help how to cope with this is appreciated. I follow steps in this
guides:

https://wiki.gentoo.org/wiki/Ceph/Installation#Installation
http://docs.ceph.com/docs/master/install/manual-deployment/#adding-osds
http://www.mad-hacking.net/documentation/linux/ha-cluster/storage-area-network/ceph-additional-nodes.xml
http://blog.widodh.nl/2014/05/deploying-ceph-over-ipv6/

Thanks in advance.

Martin

--
====================================
Ing. Martin Samek
     ICT systems engineer
     FELK Admin

Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Control Engineering
Karlovo namesti 13/E, 121 35 Prague
Czech Republic

e-mail:  samekma1@xxxxxxxxxxx
phone: +420 22435 7599
mobile: +420 605 285 125
====================================

Attachment:
smime.p7s

Description: Elektronicky podpis S/MIME
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com