Re: msgr2 not used on OSDs in some Nautilus clusters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Closing the loop here.  I figured out that I missed a step during the Nautilus upgrade which was causing this issue:

ceph osd require-osd-release nautilus

If you don't do this your cluster will start having problems once you enable msgr2:

ceph mon enable-msgr2

Based on how hard this was to track down, maybe a check should be added before enabling msgr2 to make sure the require-osd-release is set to nautilus?

Bryan

> On Nov 18, 2019, at 5:41 PM, Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote:
> 
> I cranked up debug_ms to 20 on two of these clusters today and I'm still not understanding why some of the clusters use v2 and some just use v1.
> 
> Here's the boot/peering process for the cluster which uses v2:
> 
> 2019-11-18 16:46:03.027 7fabb6281dc0  0 osd.0 39101 done with init, starting boot process
> 2019-11-18 16:46:03.028 7fabb6281dc0  1 osd.0 39101 start_boot
> 2019-11-18 16:46:03.030 7fabaebac700  5 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.3:6800/1473285,v1:10.0.32.3:6801/1473285] conn(0x5596b30c3000 0x5596b4bf4000 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=16 peer_addr_for_me=v2:10.0.32.67:51508/0
> 2019-11-18 16:46:03.034 7faba8116700  1 -- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] --> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] -- osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7 -- 0x5596b4bd6000 con 0x5596b3b06400
> 2019-11-18 16:46:03.034 7faba8116700  5 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).send_message enqueueing message m=0x5596b4bd6000 type=71 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).prepare_send_message encoding features 4611087854031667199 0x5596b4bd6000 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.034 7fabaf3ad700  5 --2- [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x5596b4bd6000 seq=8 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
> 2019-11-18 16:46:03.352 7fab9d100700  1 osd.0 39104 state: booting -> active
> 2019-11-18 16:46:03.354 7fabaebac700  5 --2- [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> [v2:10.0.32.9:6802/3892454,v1:10.0.32.9:6803/3892454] conn(0x5596b4d68800 0x5596b4bf5080 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:45488/0
> 2019-11-18 16:46:03.354 7fabafbae700  5 --2- [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> [v2:10.0.32.142:6810/2881684,v1:10.0.32.142:6811/2881684] conn(0x5596b4d68000 0x5596b4bf4580 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:39044/0
> 2019-11-18 16:46:03.355 7fabaf3ad700  5 --2-  >> [v2:10.0.32.67:6814/100535,v1:10.0.32.67:6815/100535] conn(0x5596b4d68400 0x5596b4bf4b00 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:51558/0
> 2019-11-18 16:46:03.355 7fabaf3ad700  1 -- 10.0.32.67:0/258117 learned_addr learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0)
> 2019-11-18 16:46:03.355 7fabafbae700  5 --2-  >> [v2:10.0.32.67:6812/100535,v1:10.0.32.67:6813/100535] conn(0x5596b4d68c00 0x5596b4bf5600 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=4 peer_addr_for_me=v2:10.0.32.67:40378/0
> 2019-11-18 16:46:03.355 7fabafbae700  1 -- 10.0.32.67:0/258117 learned_addr learned my addr 10.0.32.67:0/258117 (peer_addr_for_me v2:10.0.32.67:0/0)
> 
> 
> You can see at the end it learns the address to be v2:10.0.32.67:0/0, but compare that to the cluster which uses v1:
> 
> 2019-11-18 16:46:05.066 7f9182d8ce00  0 osd.0 46410 done with init, starting boot process
> 2019-11-18 16:46:05.066 7f9182d8ce00  1 osd.0 46410 start_boot
> 2019-11-18 16:46:05.069 7f917becf700  5 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.12.131:6800/3507,v1:10.0.12.131:6801/3507] conn(0x56011fc40000 0x56010f0b3b80 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).handle_hello received hello: peer_type=16 peer_addr_for_me=v2:10.0.13.2:44708/0
> 2019-11-18 16:46:05.072 7f9174c3e700  1 -- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] --> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] -- osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7 -- 0x56011fc26000 con 0x56010f291400
> 2019-11-18 16:46:05.072 7f9174c3e700  5 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).send_message enqueueing message m=0x56011fc26000 type=71 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7
> 2019-11-18 16:46:05.072 7f917c6d0700 20 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7
> 2019-11-18 16:46:05.072 7f917c6d0700 20 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).prepare_send_message encoding features 4611087854031667199 0x56011fc26000 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7
> 2019-11-18 16:46:05.072 7f917c6d0700  5 --2- [v2:10.0.13.2:6800/3084510,v1:10.0.13.2:6801/3084510] >> [v2:10.0.13.137:3300/0,v1:10.0.13.137:6789/0] conn(0x56010f291400 0x56010f0b3080 crc :-1 s=READY pgs=9200660 cs=0 l=1 rx=0 tx=0).write_message sending message m=0x56011fc26000 seq=8 osd_boot(osd.0 booted 0 features 4611087854031667199 v46410) v7
> 2019-11-18 16:46:05.951 7f917c6d0700 10 -- [v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >>  conn(0x56011fc40c00 legacy=0x56011fcc1800 unknown :-1 s=STATE_NONE l=0).accept sd=73 listen_addr v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.12.146:50030/0
> 2019-11-18 16:46:05.951 7f917c6d0700 10 -- [v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >>  conn(0x56011fc41000 legacy=0x56011fcc3000 unknown :-1 s=STATE_NONE l=0).accept sd=74 listen_addr v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.12.146:45484/0
> 2019-11-18 16:46:05.956 7f917c6d0700 10 -- [v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >>  conn(0x56011fc41400 legacy=0x56011fcc3800 unknown :-1 s=STATE_NONE l=0).accept sd=75 listen_addr v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.12.137:50776/0
> 2019-11-18 16:46:05.956 7f917c6d0700 10 -- [v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >>  conn(0x56011fc41800 legacy=0x56011fda4000 unknown :-1 s=STATE_NONE l=0).accept sd=76 listen_addr v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.12.137:60880/0
> 2019-11-18 16:46:05.960 7f917c6d0700 10 -- [v2:10.0.13.2:6802/3084510,v1:10.0.13.2:6803/3084510] >>  conn(0x56011fc41c00 legacy=0x56011fda4800 unknown :-1 s=STATE_NONE l=0).accept sd=77 listen_addr v1:10.0.13.2:6803/3084510 peer_addr v1:10.0.13.146:57630/0
> 2019-11-18 16:46:05.974 7f917c6d0700 10 -- [v2:10.0.13.2:6802/3084510,v1:10.0.13.2:6803/3084510] >>  conn(0x56011fdb0000 legacy=0x56011fda5000 unknown :-1 s=STATE_NONE l=0).accept sd=78 listen_addr v1:10.0.13.2:6803/3084510 peer_addr v1:10.0.13.134:58584/0
> 2019-11-18 16:46:05.974 7f917c6d0700 10 -- [v2:10.0.13.2:6806/3084510,v1:10.0.13.2:6807/3084510] >>  conn(0x56011fdb0400 legacy=0x56011fda5800 unknown :-1 s=STATE_NONE l=0).accept sd=79 listen_addr v1:10.0.13.2:6807/3084510 peer_addr v1:10.0.13.134:56662/0
> 2019-11-18 16:46:05.974 7f917c6d0700 10 -- [v2:10.0.13.2:6804/3084510,v1:10.0.13.2:6805/3084510] >>  conn(0x56011fdb0800 legacy=0x56011fda6000 unknown :-1 s=STATE_NONE l=0).accept sd=80 listen_addr v1:10.0.13.2:6805/3084510 peer_addr v1:10.0.13.134:39326/0
> 2019-11-18 16:46:05.977 7f9169c28700  1 osd.0 46413 state: booting -> active
> 2019-11-18 16:46:05.978 7f917b6ce700  1 -- 10.0.13.2:0/3084510 learned_addr learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0)
> 2019-11-18 16:46:05.979 7f917becf700  1 -- 10.0.13.2:0/3084510 learned_addr learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0)
> 
> 
> The learned address is v1:10.0.13.2:0/0.  What else can I do to figure out why it's deciding to use the legacy protocol only?
> 
> Thanks,
> Bryan
> 
>> On Nov 15, 2019, at 4:49 PM, Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote:
>> 
>> I've upgraded 7 of our clusters to Nautilus (14.2.4) and noticed that on some of the clusters (3 out of 7) the OSDs aren't using msgr2 at all.  Here's the output for osd.0 on 2 clusters of each type:
>> 
>> ### Cluster 1 (v1 only):
>> # ceph osd find 0 | jq -r '.addrs'
>> {
>> "addrvec": [
>>   {
>>     "type": "v1",
>>     "addr": "10.26.0.33:6809",
>>     "nonce": 4185021
>>   }
>> ]
>> }
>> 
>> ### Cluster 2 (v1 only):
>> # ceph osd find 0 | jq -r '.addrs'
>> {
>> "addrvec": [
>>   {
>>     "type": "v1",
>>     "addr": "10.197.0.243:6801",
>>     "nonce": 3802140
>>   }
>> ]
>> }
>> 
>> ### Cluster 3 (v1 & v2):
>> # ceph osd find 0 | jq -r '.addrs'
>> {
>> "addrvec": [
>>   {
>>     "type": "v2",
>>     "addr": "10.32.0.36:6802",
>>     "nonce": 3167
>>   },
>>   {
>>     "type": "v1",
>>     "addr": "10.32.0.36:6804",
>>     "nonce": 3167
>>   }
>> ]
>> }
>> 
>> ### Cluster 4 (v1 & v2):
>> # ceph osd find 0 | jq -r '.addrs'
>> {
>> "addrvec": [
>>   {
>>     "type": "v2",
>>     "addr": "10.36.0.12:6820",
>>     "nonce": 3150
>>   },
>>   {
>>     "type": "v1",
>>     "addr": "10.36.0.12:6827",
>>     "nonce": 3150
>>   }
>> ]
>> }
>> 
>> 
>> All of the mon nodes have the same msgr settings of:
>> 
>> # ceph daemon mon.$(hostname -s) config show | grep msgr
>>   "mon_warn_on_msgr2_not_enabled": "true",
>>   "ms_bind_msgr1": "true",
>>   "ms_bind_msgr2": "true",
>>   "ms_msgr2_encrypt_messages": "false",
>>   "ms_msgr2_sign_messages": "false",
>> 
>> 
>> What could be causing this?  All of the clusters are listening on port 3300 for v2 and 6789 for v1.  I can even connect to port 3300 on the mon nodes from the OSD nodes.
>> 
>> Thanks,
>> Bryan
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux