Re: Bluestores+LVM via ceph-volume in Luminous?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018/02/01 1:42 pm, Andre Goree wrote:
On 2018/02/01 1:17 pm, Andre Goree wrote:
On 2018/02/01 11:58 am, Alfredo Deza wrote:
This is the actual command:

/usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
a2ee64a4-b5ba-4ca9-8528-4205f3ad8c99

What that command is trying to do is to tell the monitor about the
newly created OSD. It is easy to replicate this "hanging" problem if
you modify your ceph.conf to point to an invalid IP for
the monitors.



Thank you for confirming that and pointing me in the right direction!

It would appear my network configuration is certainly correct (from my understanding; "public" network is 172.16.238.0/24, cluster network is 172.16.239.0/24 -- a configuration that works for the other OSDs built with
ceph-ansible/ceph-disk) and I can reach port 6789 on my MON node:

~# ping -c4 172.16.238.11 && ping -c4 172.16.239.11
PING 172.16.238.11 (172.16.238.11) 56(84) bytes of data.
64 bytes from 172.16.238.11: icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from 172.16.238.11: icmp_seq=2 ttl=64 time=0.102 ms
64 bytes from 172.16.238.11: icmp_seq=3 ttl=64 time=0.107 ms
64 bytes from 172.16.238.11: icmp_seq=4 ttl=64 time=0.096 ms

--- 172.16.238.11 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.096/0.111/0.141/0.020 ms
PING 172.16.239.11 (172.16.239.11) 56(84) bytes of data.
64 bytes from 172.16.239.11: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 172.16.239.11: icmp_seq=2 ttl=64 time=0.133 ms
64 bytes from 172.16.239.11: icmp_seq=3 ttl=64 time=0.098 ms
64 bytes from 172.16.239.11: icmp_seq=4 ttl=64 time=0.103 ms

--- 172.16.239.11 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2998ms
rtt min/avg/max/mdev = 0.098/0.146/0.252/0.063 ms
~# telnet 172.16.238.11 6789
Trying 172.16.238.11...
Connected to 172.16.238.11.
Escape character is '^]'.
ceph v027???^?^]quit

telnet> quit
Connection closed.


Is there a command you'd recommend I use to try to ensure connectivity to the MON node from this new OSD node to perhaps help troubleshoot this issue
I'm having?

You need to make sure you are correlating your network interactions
with the same values Ceph is configured with. Like in my example
before, it is easy to replicate if
you have an incorrect IP in your ceph.conf

This might be 10.0.0.1 and you are pinging 10.0.1.0 and it works, but
ceph is using the incorrect one :)

I don't have a specific command that might get you closer.

I would go through the mon and osd troubleshooting guides

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/


Thanks again.  I'm able to confirm via tcpdump that the osd node is
indeed attempting (and reaching) the mon nodes (which respond), but
apparently they aren't producing anything of substance back to the osd
node (based on strace, et al.):

13:02:48.823279 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[.], ack 282, win 219, options [nop,nop,TS val 19431869 ecr
365886589], length 0
13:02:48.823296 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 146:179, ack 282, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 33
13:02:48.823322 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 10:146, ack 282, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 136
13:02:48.823350 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 146:179, ack 282, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 33
13:02:48.823356 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[.], ack 179, win 219, options [nop,nop,TS val 364906985 ecr
19431869], length 0
13:02:48.823380 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 282:316, ack 179, win 219, options [nop,nop,TS val 364906985
ecr 19431869], length 34
13:02:48.823400 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[.], ack 179, win 219, options [nop,nop,TS val 365886589 ecr
19431869], length 0
13:02:48.823423 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 179:187, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 8
13:02:48.823428 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 282:316, ack 179, win 219, options [nop,nop,TS val 365886589
ecr 19431869], length 34
13:02:48.823449 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 179:187, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 8
13:02:48.823478 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 187:343, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 156
13:02:48.823483 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 187:343, ack 316, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 156
13:02:48.823519 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[.], ack 343, win 227, options [nop,nop,TS val 364906985 ecr
19431869], length 0
13:02:48.823535 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[.], ack 343, win 227, options [nop,nop,TS val 365886589 ecr
19431869], length 0
13:02:48.823569 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 316:325, ack 343, win 227, options [nop,nop,TS val 364906985
ecr 19431869], length 9
13:02:48.823612 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 316:325, ack 343, win 227, options [nop,nop,TS val 365886589
ecr 19431869], length 9
13:02:48.823711 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 325:433, ack 343, win 227, options [nop,nop,TS val 364906985
ecr 19431869], length 108
13:02:48.823736 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[.], ack 433, win 219, options [nop,nop,TS val 19431869 ecr
364906985], length 0
13:02:48.823764 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 325:433, ack 343, win 227, options [nop,nop,TS val 365886589
ecr 19431869], length 108
13:02:48.823839 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[.], ack 433, win 219, options [nop,nop,TS val 19431869 ecr
365886589], length 0
13:02:48.823891 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 343:480, ack 433, win 219, options [nop,nop,TS val 19431869
ecr 364906985], length 137
13:02:48.823970 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 343:480, ack 433, win 219, options [nop,nop,TS val 19431869
ecr 365886589], length 137
13:02:48.824249 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 433:730, ack 480, win 235, options [nop,nop,TS val 364906985
ecr 19431869], length 297
13:02:48.824347 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 433:730, ack 480, win 235, options [nop,nop,TS val 365886589
ecr 19431869], length 297
13:02:48.824423 IP 172.16.238.21.46962 > 172.16.238.13.6789: Flags
[P.], seq 480:766, ack 730, win 227, options [nop,nop,TS val 19431869
ecr 364906985], length 286
13:02:48.824536 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[P.], seq 480:766, ack 730, win 227, options [nop,nop,TS val 19431869
ecr 365886589], length 286
13:02:48.824970 IP 172.16.238.13.6789 > 172.16.238.21.46962: Flags
[P.], seq 730:1401, ack 766, win 244, options [nop,nop,TS val
364906985 ecr 19431869], length 671
13:02:48.825004 IP 172.16.238.11.6789 > 172.16.238.21.35578: Flags
[P.], seq 730:1401, ack 766, win 244, options [nop,nop,TS val
365886589 ecr 19431869], length 671
13:02:48.825166 IP 172.16.238.21.35578 > 172.16.238.11.6789: Flags
[F.], seq 766, ack 1401, win 238, options [nop,nop,TS val 19431869 ecr
365886589], length 0

172.16.238.21 is the OSD, 172.16.238.11 and 172.16.238.12 are MONs.
Perhaps then, this is a bug within ceph?  I wonder if there are
verbose logs on the mon that might show something or perhaps I can
trace something there?

In any case, I'm going to go through the troubleshooting guides and
see if they're any help here.  I otherwise may try
ceph-ansible/stable-3.0 with ceph-disk (since this "worked" with the
other OSDs, meaning ceph-ansible was able to complete without this
"hang") before I just tear it all down and try to build out manually.





Turned up the debug logging on the MON node and I do see interaction
with the OSD node that corresponds with my tcpdump data, oddly enough:
2018-02-01 13:37:49.503808 7f52e2ec9700 10 mon.mon-01@0(leader) e1
ms_handle_reset 0x55d839ece000 172.16.238.21:0/2002450442
2018-02-01 13:37:49.503832 7f52e2ec9700 10 mon.mon-01@0(leader) e1
reset/close on session client.48392 172.16.238.21:0/2002450442
2018-02-01 13:37:49.503838 7f52e2ec9700 10 mon.mon-01@0(leader) e1
remove_session 0x55d8472ace00 client.48392 172.16.238.21:0/2002450442
features 0x1ffddff8eea4fffb
2018-02-01 13:37:49.504116 7f52deec1700 10 mon.mon-01@0(leader) e1
ms_verify_authorizer 172.16.238.21:0/2002450442 client protocol 0
2018-02-01 13:37:49.504379 7f52e2ec9700 10 mon.mon-01@0(leader) e1
_ms_dispatch new session 0x55d8472ace00 MonSession(client.48392
172.16.238.21:0/2002450442 is open ) features 0x1ffddff8eea4fffb
2018-02-01 13:37:49.504408 7f52e2ec9700 10 mon.mon-01@0(leader).auth
v526 preprocess_query auth(proto 0 42 bytes epoch 1) v1 from
client.48392 172.16.238.21:0/2002450442
2018-02-01 13:37:49.504418 7f52e2ec9700 10 mon.mon-01@0(leader).auth
v526 prep_auth() blob_size=42
2018-02-01 13:37:49.504458 7f52e2ec9700  2 mon.mon-01@0(leader) e1
send_reply 0x55d82f901b80 0x55d82fb66d00 auth_reply(proto 2 0 (0)
Success) v1
2018-02-01 13:37:49.504872 7f52e2ec9700 10 mon.mon-01@0(leader).auth
v526 preprocess_query auth(proto 2 32 bytes epoch 0) v1 from
client.48392 172.16.238.21:0/2002450442
2018-02-01 13:37:49.504882 7f52e2ec9700 10 mon.mon-01@0(leader).auth
v526 prep_auth() blob_size=32
2018-02-01 13:37:49.505040 7f52e2ec9700  2 mon.mon-01@0(leader) e1
send_reply 0x55d82f901b80 0x55d847be4580 auth_reply(proto 2 0 (0)
Success) v1
2018-02-01 13:37:49.505561 7f52e2ec9700 10 mon.mon-01@0(leader).auth
v526 preprocess_query auth(proto 2 181 bytes epoch 0) v1 from
client.48392 172.16.238.21:0/2002450442
2018-02-01 13:37:49.505571 7f52e2ec9700 10 mon.mon-01@0(leader).auth
v526 prep_auth() blob_size=181
2018-02-01 13:37:49.505835 7f52e2ec9700  2 mon.mon-01@0(leader) e1
send_reply 0x55d82f901b80 0x55d856598080 auth_reply(proto 2 0 (0)
Success) v1
2018-02-01 13:37:49.506632 7f52e2ec9700 10 mon.mon-01@0(leader) e1
handle_subscribe mon_subscribe({mgrmap=0+,monmap=2+,osdmap=0}) v2
2018-02-01 13:37:49.506673 7f52e2ec9700 10 mon.mon-01@0(leader).monmap
v1 check_sub monmap next 2 have 1
2018-02-01 13:37:49.506684 7f52e2ec9700 10 mon.mon-01@0(leader).osd
e4112 check_osdmap_sub 0x55d84a963ba0 next 0 (onetime)
2018-02-01 13:37:49.510809 7f52e2ec9700  2 mon.mon-01@0(leader) e1
send_reply 0x55d832e04000 0x55d84178d600 mon_command_ack([{"prefix":
"get_command_descriptions"}]=0  v0) v1



--
Andre Goree
-=-=-=-=-=-
Email     - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


FWIW, this indeed turned out to be a network configuration issue (MTU sizes differed on the machine and on the switch). Just thought I'd come back and mention that for anyone else who might run into something like this :)

--
Andre Goree
-=-=-=-=-=-
Email     - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux