We have upgraded one of our Ceph clusters to Nautilus. We have run into
2 issues that are causing osds to flap. I'll cover the 1st one here,
this one we solved but it raises an interesting question that might bear
on the 2nd one (will post that next).
After upgrading deb packages to Nautilus and restarting the mons and
mgrs we worked through restarting the osds. We started to see some of
them flap and saw this in the osd log (many times):
2024-07-31 11:03:33.264 7f22ab6e0700 0 --1-
[v2:[2404:130:8020:5::73]:6820/220732,v1:[2404:130:8020:5::73]:6821/220732]
>> v1:[2404:130:8020:5::103]:6909/2987374 conn(0x555a5bbbb180
0x555a4172c800 :-1 s=OPENED pgs=144 cs=3 l=0).fault initiating reconnect
And later (usually a single line):
2404:130:8020:5::135]:6903/2933993 conn(0x555ab7b03200 0x555a099fa000
:-1 s=CONNECTING_SEND_CONNECT_MSG pgs=154 cs=4
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
Examining the code showed:
markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l
"initiating reconnect" {} \;
./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc
./src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ vi src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l "got
BADAUTHORIZER" {} \;
./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc
Which led us to suspect that the osds were using the v1 msgr protocol
(ceph osd dump seems to validate this). We hoped that once we enabled
the v2 msgr that this error would vanish. This appears to have happened.
So my question is this: looks like there is something wrong with
communications via v1 protocol post upgrade - is that expected?
Regards
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx