Re: Osds going down/flapping after Luminous to Nautilus upgrade part 1

Eugen Block <eblock@xxxxxx> · Tue, 06 Aug 2024 11:25:20 +0000

Hi,

the upgrade notes for Nautilus [0] contain this section:

Running nautilus OSDs will not bind to their v2 address  
automatically. They must be restarted for that to happen.

Regards,
Eugen

[0] https://docs.ceph.com/en/latest/releases/nautilus/#instructions

Zitat von Mark Kirkwood <markkirkwood@xxxxxxxxxxxxxxxx>:

We have upgraded one of our Ceph clusters to Nautilus. We have run  
into 2 issues that are causing osds to flap. I'll cover the 1st one  
here, this one we solved but it raises an interesting question that  
might bear on the 2nd one (will post that next).

After upgrading deb packages to Nautilus and restarting the mons and  
mgrs we worked through restarting the osds. We started to see some  
of them flap and saw this in the osd log (many times):

2024-07-31 11:03:33.264 7f22ab6e0700  0 --1-  
[v2:[2404:130:8020:5::73]:6820/220732,v1:[2404:130:8020:5::73]:6821/220732]  
>> v1:[2404:130:8020:5::103]:6909/2987374 conn(0x555a5bbbb180  
0x555a4172c800 :-1 s=OPENED pgs=144 cs=3 l=0).fault initiating  
reconnect

And later (usually a single line):

2404:130:8020:5::135]:6903/2933993 conn(0x555ab7b03200  
0x555a099fa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=154 cs=4  
l=0).handle_connect_reply_2 connect got BADAUTHORIZER

Examining the code showed:

markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l  
"initiating reconnect" {} \;
./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc
./src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ vi src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l  
"got BADAUTHORIZER" {} \;
./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc

Which led us to suspect that the osds were using the v1 msgr  
protocol (ceph osd dump seems to validate this). We hoped that once  
we enabled the v2 msgr that this error would vanish. This appears to  
have happened.

So my question is this: looks like there is something wrong with  
communications via v1 protocol post upgrade - is that expected?

Regards

Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx