Closing the loop here. I figured out that I missed a step during the Nautilus upgrade which was causing this issue: ceph osd require-osd-release nautilus If you don't do this your cluster will start having problems once you enable msgr2: ceph mon enable-msgr2 Based on how hard this was to track down, maybe a check should be added before enabling msgr2 to make sure the require-osd-release is set to nautilus? Bryan > On Nov 18, 2019, at 5:41 PM, Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote: > > I cranked up debug_ms to 20 on two of these clusters today and I'm still not understanding why some of the clusters use v2 and some just use v1. > > Here's the boot/peering process for the cluster which uses v2: > > 2019-11-18 16:46:03.027 7fabb6281dc0 0 osd.0 39101 done with init, starting boot process > 2019-11-18 16:46:03.028 7fabb6281dc0 1 osd.0 39101 start_boot > 2019-11-18 16:46:03.030 7fabaebac700 5 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.3:6800/1473285,v1:10.0.32.3:6801/1473285] conn(0x5596b30c3000 0x5596b4bf4000 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=16 peer_addr_for_me=v2:10.0.32.67:51508/0 > 2019-11-18 16:46:03.034 7faba8116700 1 -- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] --> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] -- osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7 -- 0x5596b4bd6000 con 0x5596b3b06400 > 2019-11-18 16:46:03.034 7faba8116700 5 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).send_message enqueueing message m=0x5596b4bd6000 type=71 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7 > 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7 > 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).prepare_send_message encoding features 4611087854031667199 0x5596b4bd6000 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7 > 2019-11-18 16:46:03.034 7fabaf3ad700 5 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x5596b4bd6000 seq=8 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7 > 2019-11-18 16:46:03.352 7fab9d100700 1 osd.0 39104 state: booting -> active > 2019-11-18 16:46:03.354 7fabaebac700 5 --2- [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> [v2:10.0.32.9:6802/3892454,v1:10.0.32.9:6803/3892454] conn(0x5596b4d68800 0x5596b4bf5080 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:45488/0 > 2019-11-18 16:46:03.354 7fabafbae700 5 --2- [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> [v2:10.0.32.142:6810/2881684,v1:10.0.32.142:6811/2881684] conn(0x5596b4d68000 0x5596b4bf4580 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:39044/0 > 2019-11-18 16:46:03.355 7fabaf3ad700 5 --2- >> [v2:10.0.32.67:6814/100535,v1:10.0.32.67:6815/100535] conn(0x5596b4d68400 0x5596b4bf4b00 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:51558/0 > 2019-11-18 16:46:03.355 7fabaf3ad700 1 -- 10.0.32.67:0/258117 learned_addr learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0) > 2019-11-18 16:46:03.355 7fabafbae700 5 --2- >> [v2:10.0.32.67:6812/100535,v1:10.0.32.67:6813/100535] conn(0x5596b4d68c00 0x5596b4bf5600 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:40378/0 > 2019-11-18 16:46:03.355 7fabafbae700 1 -- 10.0.32.67:0/258117 learned_addr learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0) > > > You can see at the end it learns the address to be v2:10.0.32.67:0/0, but compare that to the cluster which uses v1: > > 2019-11-18 16:46:05.066 7f9182d8ce00 0 osd.0 46410 done with init, starting boot process > 2019-11-18 16:46:05.066 7f9182d8ce00 1 osd.0 46410 start_boot > 2019-11-18 16:46:05.069 7f917becf700 5 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.12.131:6800/3507,v1:10.0.12.131:6801/3507] conn(0x56011fc40000 0x56010f0b3b80 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=16 peer_addr_for_me=v2:10.0.13.2:44708/0 > 2019-11-18 16:46:05.072 7f9174c3e700 1 -- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] --> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] -- osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7 -- 0x56011fc26000 con 0x56010f291400 > 2019-11-18 16:46:05.072 7f9174c3e700 5 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).send_message enqueueing message m=0x56011fc26000 type=71 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7 > 2019-11-18 16:46:05.072 7f917c6d0700 20 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7 > 2019-11-18 16:46:05.072 7f917c6d0700 20 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).prepare_send_message encoding features 4611087854031667199 0x56011fc26000 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7 > 2019-11-18 16:46:05.072 7f917c6d0700 5 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x56011fc26000 seq=8 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7 > 2019-11-18 16:46:05.951 7f917c6d0700 10 -- [v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >> conn(0x56011fc40c00 legacy=0x56011fcc1800 unknown :-1 s=STATE_NONE l=0).accept sd=73 listen_addr v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.12.146:50030/0 > 2019-11-18 16:46:05.951 7f917c6d0700 10 -- [v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >> conn(0x56011fc41000 legacy=0x56011fcc3000 unknown :-1 s=STATE_NONE l=0).accept sd=74 listen_addr v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.12.146:45484/0 > 2019-11-18 16:46:05.956 7f917c6d0700 10 -- [v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >> conn(0x56011fc41400 legacy=0x56011fcc3800 unknown :-1 s=STATE_NONE l=0).accept sd=75 listen_addr v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.12.137:50776/0 > 2019-11-18 16:46:05.956 7f917c6d0700 10 -- [v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >> conn(0x56011fc41800 legacy=0x56011fda4000 unknown :-1 s=STATE_NONE l=0).accept sd=76 listen_addr v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.12.137:60880/0 > 2019-11-18 16:46:05.960 7f917c6d0700 10 -- [v2:10.0.13.2:6802/3084510,v1:10.0.13.2:6803/3084510] >> conn(0x56011fc41c00 legacy=0x56011fda4800 unknown :-1 s=STATE_NONE l=0).accept sd=77 listen_addr v1:10.0.13.2:6803/3084510 peer_addr v1:10.0.13.146:57630/0 > 2019-11-18 16:46:05.974 7f917c6d0700 10 -- [v2:10.0.13.2:6802/3084510,v1:10.0.13.2:6803/3084510] >> conn(0x56011fdb0000 legacy=0x56011fda5000 unknown :-1 s=STATE_NONE l=0).accept sd=78 listen_addr v1:10.0.13.2:6803/3084510 peer_addr v1:10.0.13.134:58584/0 > 2019-11-18 16:46:05.974 7f917c6d0700 10 -- [v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >> conn(0x56011fdb0400 legacy=0x56011fda5800 unknown :-1 s=STATE_NONE l=0).accept sd=79 listen_addr v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.13.134:56662/0 > 2019-11-18 16:46:05.974 7f917c6d0700 10 -- [v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >> conn(0x56011fdb0800 legacy=0x56011fda6000 unknown :-1 s=STATE_NONE l=0).accept sd=80 listen_addr v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.13.134:39326/0 > 2019-11-18 16:46:05.977 7f9169c28700 1 osd.0 46413 state: booting -> active > 2019-11-18 16:46:05.978 7f917b6ce700 1 -- 10.0.13.2:0/3084510 learned_addr learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0) > 2019-11-18 16:46:05.979 7f917becf700 1 -- 10.0.13.2:0/3084510 learned_addr learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0) > > > The learned address is v1:10.0.13.2:0/0. What else can I do to figure out why it's deciding to use the legacy protocol only? > > Thanks, > Bryan > >> On Nov 15, 2019, at 4:49 PM, Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote: >> >> I've upgraded 7 of our clusters to Nautilus (14.2.4) and noticed that on some of the clusters (3 out of 7) the OSDs aren't using msgr2 at all. Here's the output for osd.0 on 2 clusters of each type: >> >> ### Cluster 1 (v1 only): >> # ceph osd find 0 | jq -r '.addrs' >> { >> "addrvec": [ >> { >> "type": "v1", >> "addr": "10.26.0.33:6809", >> "nonce": 4185021 >> } >> ] >> } >> >> ### Cluster 2 (v1 only): >> # ceph osd find 0 | jq -r '.addrs' >> { >> "addrvec": [ >> { >> "type": "v1", >> "addr": "10.197.0.243:6801", >> "nonce": 3802140 >> } >> ] >> } >> >> ### Cluster 3 (v1 & v2): >> # ceph osd find 0 | jq -r '.addrs' >> { >> "addrvec": [ >> { >> "type": "v2", >> "addr": "10.32.0.36:6802", >> "nonce": 3167 >> }, >> { >> "type": "v1", >> "addr": "10.32.0.36:6804", >> "nonce": 3167 >> } >> ] >> } >> >> ### Cluster 4 (v1 & v2): >> # ceph osd find 0 | jq -r '.addrs' >> { >> "addrvec": [ >> { >> "type": "v2", >> "addr": "10.36.0.12:6820", >> "nonce": 3150 >> }, >> { >> "type": "v1", >> "addr": "10.36.0.12:6827", >> "nonce": 3150 >> } >> ] >> } >> >> >> All of the mon nodes have the same msgr settings of: >> >> # ceph daemon mon.$(hostname -s) config show | grep msgr >> "mon_warn_on_msgr2_not_enabled": "true", >> "ms_bind_msgr1": "true", >> "ms_bind_msgr2": "true", >> "ms_msgr2_encrypt_messages": "false", >> "ms_msgr2_sign_messages": "false", >> >> >> What could be causing this? All of the clusters are listening on port 3300 for v2 and 6789 for v1. I can even connect to port 3300 on the mon nodes from the OSD nodes. >> >> Thanks, >> Bryan > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx