On 2018/02/01 11:58 am, Alfredo Deza wrote:
This is the actual command:
/usr/bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
a2ee64a4-b5ba-4ca9-8528-4205f3ad8c99
What that command is trying to do is to tell the monitor about the
newly created OSD. It is easy to replicate this "hanging" problem if
you modify your ceph.conf to point to an invalid IP for
the monitors.
Thank you for confirming that and pointing me in the right direction!
It would appear my network configuration is certainly correct (from
my
understanding; "public" network is 172.16.238.0/24, cluster network
is
172.16.239.0/24 -- a configuration that works for the other OSDs
built with
ceph-ansible/ceph-disk) and I can reach port 6789 on my MON node:
~# ping -c4 172.16.238.11 && ping -c4 172.16.239.11
PING 172.16.238.11 (172.16.238.11) 56(84) bytes of data.
64 bytes from 172.16.238.11: icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from 172.16.238.11: icmp_seq=2 ttl=64 time=0.102 ms
64 bytes from 172.16.238.11: icmp_seq=3 ttl=64 time=0.107 ms
64 bytes from 172.16.238.11: icmp_seq=4 ttl=64 time=0.096 ms
--- 172.16.238.11 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.096/0.111/0.141/0.020 ms
PING 172.16.239.11 (172.16.239.11) 56(84) bytes of data.
64 bytes from 172.16.239.11: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 172.16.239.11: icmp_seq=2 ttl=64 time=0.133 ms
64 bytes from 172.16.239.11: icmp_seq=3 ttl=64 time=0.098 ms
64 bytes from 172.16.239.11: icmp_seq=4 ttl=64 time=0.103 ms
--- 172.16.239.11 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2998ms
rtt min/avg/max/mdev = 0.098/0.146/0.252/0.063 ms
~# telnet 172.16.238.11 6789
Trying 172.16.238.11...
Connected to 172.16.238.11.
Escape character is '^]'.
ceph v027???^?^]quit
telnet> quit
Connection closed.
Is there a command you'd recommend I use to try to ensure
connectivity to
the MON node from this new OSD node to perhaps help troubleshoot this
issue
I'm having?
You need to make sure you are correlating your network interactions
with the same values Ceph is configured with. Like in my example
before, it is easy to replicate if
you have an incorrect IP in your ceph.conf
This might be 10.0.0.1 and you are pinging 10.0.1.0 and it works, but
ceph is using the incorrect one :)
I don't have a specific command that might get you closer.
I would go through the mon and osd troubleshooting guides
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
Thanks again. I'm able to confirm via tcpdump that the osd node is
indeed attempting (and reaching) the mon nodes (which respond), but
apparently they aren't producing anything of substance back to the osd
node (based on strace, et al.):
13:02:48.823279 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[.], ack 282, win 219, options [nop,nop,TS val 19431869 ecr
365886589], length 0
13:02:48.823296 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 146:179, ack 282, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 33
13:02:48.823322 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 10:146, ack 282, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 136
13:02:48.823350 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 146:179, ack 282, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 33
13:02:48.823356 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[.], ack 179, win 219, options [nop,nop,TS val 364906985 ecr
19431869], length 0
13:02:48.823380 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 282:316, ack 179, win 219, options [nop,nop,TS val 364906985
ecr 19431869], length 34
13:02:48.823400 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[.], ack 179, win 219, options [nop,nop,TS val 365886589 ecr
19431869], length 0
13:02:48.823423 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 179:187, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 8
13:02:48.823428 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 282:316, ack 179, win 219, options [nop,nop,TS val 365886589
ecr 19431869], length 34
13:02:48.823449 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 179:187, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 8
13:02:48.823478 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 187:343, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 156
13:02:48.823483 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 187:343, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 156
13:02:48.823519 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[.], ack 343, win 227, options [nop,nop,TS val 364906985 ecr
19431869], length 0
13:02:48.823535 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[.], ack 343, win 227, options [nop,nop,TS val 365886589 ecr
19431869], length 0
13:02:48.823569 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 316:325, ack 343, win 227, options [nop,nop,TS val 364906985
ecr 19431869], length 9
13:02:48.823612 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 316:325, ack 343, win 227, options [nop,nop,TS val 365886589
ecr 19431869], length 9
13:02:48.823711 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 325:433, ack 343, win 227, options [nop,nop,TS val 364906985
ecr 19431869], length 108
13:02:48.823736 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[.], ack 433, win 219, options [nop,nop,TS val 19431869 ecr
364906985], length 0
13:02:48.823764 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 325:433, ack 343, win 227, options [nop,nop,TS val 365886589
ecr 19431869], length 108
13:02:48.823839 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[.], ack 433, win 219, options [nop,nop,TS val 19431869 ecr
365886589], length 0
13:02:48.823891 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 343:480, ack 433, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 137
13:02:48.823970 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 343:480, ack 433, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 137
13:02:48.824249 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 433:730, ack 480, win 235, options [nop,nop,TS val 364906985
ecr 19431869], length 297
13:02:48.824347 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 433:730, ack 480, win 235, options [nop,nop,TS val 365886589
ecr 19431869], length 297
13:02:48.824423 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 480:766, ack 730, win 227, options [nop,nop,TS val 19431869
ecr 364906985], length 286
13:02:48.824536 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 480:766, ack 730, win 227, options [nop,nop,TS val 19431869
ecr 365886589], length 286
13:02:48.824970 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 730:1401, ack 766, win 244, options [nop,nop,TS val
364906985 ecr 19431869], length 671
13:02:48.825004 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 730:1401, ack 766, win 244, options [nop,nop,TS val
365886589 ecr 19431869], length 671
13:02:48.825166 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[F.], seq 766, ack 1401, win 238, options [nop,nop,TS val 19431869 ecr
365886589], length 0
172.16.238.21 is the OSD, 172.16.238.11 and 172.16.238.12 are MONs.
Perhaps then, this is a bug within ceph? I wonder if there are
verbose logs on the mon that might show something or perhaps I can
trace something there?
In any case, I'm going to go through the troubleshooting guides and
see if they're any help here. I otherwise may try
ceph-ansible/stable-3.0 with ceph-disk (since this "worked" with the
other OSDs, meaning ceph-ansible was able to complete without this
"hang") before I just tear it all down and try to build out manually.