Re: Osds going down/flapping after Luminous to Nautilus upgrade part 1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

the upgrade notes for Nautilus [0] contain this section:

Running nautilus OSDs will not bind to their v2 address automatically. They must be restarted for that to happen.

Regards,
Eugen

[0] https://docs.ceph.com/en/latest/releases/nautilus/#instructions

Zitat von Mark Kirkwood <markkirkwood@xxxxxxxxxxxxxxxx>:

We have upgraded one of our Ceph clusters to Nautilus. We have run into 2 issues that are causing osds to flap. I'll cover the 1st one here, this one we solved but it raises an interesting question that might bear on the 2nd one (will post that next).

After upgrading deb packages to Nautilus and restarting the mons and mgrs we worked through restarting the osds. We started to see some of them flap and saw this in the osd log (many times):

2024-07-31 11:03:33.264 7f22ab6e0700  0 --1- [v2:[2404:130:8020:5::73]:6820/220732,v1:[2404:130:8020:5::73]:6821/220732] >> v1:[2404:130:8020:5::103]:6909/2987374 conn(0x555a5bbbb180 0x555a4172c800 :-1 s=OPENED pgs=144 cs=3 l=0).fault initiating reconnect

And later (usually a single line):

2404:130:8020:5::135]:6903/2933993 conn(0x555ab7b03200 0x555a099fa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=154 cs=4 l=0).handle_connect_reply_2 connect got BADAUTHORIZER

Examining the code showed:

markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l "initiating reconnect" {} \;
./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc
./src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ vi src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l "got BADAUTHORIZER" {} \;
./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc

Which led us to suspect that the osds were using the v1 msgr protocol (ceph osd dump seems to validate this). We hoped that once we enabled the v2 msgr that this error would vanish. This appears to have happened.

So my question is this: looks like there is something wrong with communications via v1 protocol post upgrade - is that expected?

Regards

Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux