That log shows 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: tick 2019-04-03 15:39:53.299 7f3733f18700 10 cephx: validate_tickets want 53 have 53 need 0 2019-04-03 15:39:53.299 7f3733f18700 20 cephx client: need_tickets: want=53 have=53 need=0 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-04-03 15:39:23.301595) 2019-04-03 15:39:53.299 7f3733f18700 10 auth: dump_rotating: 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41691 A4Q== expires 2019-04-03 14:43:07.042860 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41692 AD9Q== expires 2019-04-03 15:43:09.895511 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41693 ADQ== expires 2019-04-03 16:43:09.895511 which is all fine. It is getting BADAUTHORIZER talking to another OSD, but I'm guessing it's because that other OSD doesn't have the right tickets. It's hard to tell what's wrong without having al the OSD logs and being able to see the matching ticket renewals (or lack thereof) on the other end of a specific connection. I missed this before: WHO MASK LEVEL OPTION VALUE RO global advanced auth_client_required cephx * global advanced auth_cluster_required cephx * global advanced auth_service_required cephx * ^ Note that these three options aren't actually needed. The only non-default value is auth_client_required, and it's 'cephx, none' meaning the client-side code will allow itself to connect to a cluster with auth disabled. global advanced auth_service_ticket_ttl 3600.000000 global advanced cephx_sign_messages false The fact that the tickets aren't renewing properly suggests that there's something broken when the ttl option is modified (although it's not clear what yet). Similarly, disabling message signing isn't common either and isn't covered by the automated test suite. If you're just trying to get the cluster up I'd remove one or both of those two options and restart all daemons. Looking closer, I see lots of these: 2019-04-03 15:39:56.435 7f374a1fb700 10 _calc_signature seq 46 front_crc_ = 1548381741 middle_crc = 0 data_crc = 0 sig = 7970349432988260882 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 Sender did not set CEPH_MSG_FOOTER_SIGNED. 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 Message signature does not match contents. 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46Signature on message: 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 sig: 0 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46Locally calculated signature: 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 sig_check:7970349432988260882 2019-04-03 15:39:56.435 7f374a1fb700 0 Signature failed. 2019-04-03 15:39:56.435 7f374a1fb700 0 --1- v1:10.36.9.48:6817/47343 >> v1:10.36.9.37:6849/12739 conn(0x58ef0400 0x3c2fd800 :6817 s=READ_FOOTER_AND_DISPATCH pgs=3071352 cs=412808 l=0).handle_message_footer Signature check failed I'm not sure which peer it's talking to here, but it looks like the cephx_sign_messages setting isn't being uniformly applied. Try removing that option! (Out of curiosity, why was it disabled?) sage On Wed, 3 Apr 2019, Shawn Edwards wrote: > ceph-post-file: 789769c4-e7e4-47d3-8fb7-475ea4cfe14a > > This should have the information you need. > > On Wed, Apr 3, 2019 at 5:49 PM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > This OSD also appears on teh accepting end of things, and probably > > has newer keys that the OSD connecting (tho it' shard to tell > > because teh debug level isn't turned up). > > > > The goal is to find an osd still throwing BADAUTHORIZER messages and then > > turn of debug_auth=20 and debug_monc=20. Specifically I'm looking for the > > output from MonClient::_check_auth_rotating(), which should tell us > > whether the OSD thinks its rotating keys are up to date, or whether it is > > having trouble renewing them, or what. It's weird that some OSDs are > > using hours-old keys to try to authenticate. :/ > > > > sage > > > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote: > > > > > ceph-post-file: 60e07a0c-ee5b-4174-9f51-fa091d5662dc > > > > > > On Wed, Apr 3, 2019 at 5:30 PM Shawn Edwards <lesser.evil@xxxxxxxxx> > > wrote: > > > > > > > According to ceph versions, all bits are running 14.2.0 > > > > > > > > I have restarted all of the OSD at least twice and am still getting the > > > > same error. > > > > > > > > I'll send a log file with confirmed interesting bad behavior shortly > > > > > > > > On Wed, Apr 3, 2019, 17:17 Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 --1- v1: > > 10.36.9.46:6813/5003637 > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813 > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > > l=0).handle_connect_message_2 > > > >> authorizor_protocol 2 len 174 > > > >> 2019-04-03 15:04:01.986 7ffae5778700 20 AuthRegistry(0xcd64a40) > > > >> get_handler peer_type 4 method 2 cluster_methods [2] service_methods > > [2] > > > >> client_methods [2] > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 cephx: verify_authorizer > > > >> decrypted service osd secret_id=41686 > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 auth: could not find > > > >> secret_id=41686 > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: dump_rotating: > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41691 ... expires > > > >> 2019-04-03 14:43:07.042860 > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41692 ... expires > > > >> 2019-04-03 15:43:09.895511 > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41693 ... expires > > > >> 2019-04-03 16:43:09.895511 > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 cephx: verify_authorizer could > > > >> not get service secret for service osd secret_id=41686 > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 --1- v1: > > 10.36.9.46:6813/5003637 > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813 > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > > l=0).handle_connect_message_2: > > > >> got bad authorizer, auth_reply_len=0 > > > >> > > > >> For some reason this OSD has much newer rotating keys than the > > > >> connecting OSD. But earlier in the day, this osd was the one > > > >> getting BADAUTHORIZER, so maybe that shifted. Can you find an OSD > > where > > > >> you still see BADAUTHORIZER appearing in the log? > > > >> > > > >> My guess is that if you restart the OSDs, they'll get fresh rotating > > keys > > > >> and things will be fine. But that doesn't explain why they're not > > > >> renewing on their own right now.. that I'm not so sure about. > > > >> > > > >> Are your mons all running nautilus? Does 'ceph versions' show > > everything > > > >> has upgraded? > > > >> > > > >> sage > > > >> > > > >> > > > >> On Wed, 3 Apr 2019, Shawn Edwards wrote: > > > >> > > > >> > File uploaded: f1a2bfb3-92b4-495c-8706-f99cb228efc7 > > > >> > > > > >> > On Wed, Apr 3, 2019 at 4:57 PM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > >> > > > > >> > > Hmm, that doesn't help. > > > >> > > > > > >> > > Can you set > > > >> > > > > > >> > > ceph config set osd debug_ms 20 > > > >> > > ceph config set osd debug_auth 20 > > > >> > > ceph config set osd debug_monc 20 > > > >> > > > > > >> > > for a few minutes and ceph-post-file the osd logs? (Or send a > > private > > > >> > > email with a link or something.) > > > >> > > > > > >> > > Thanks! > > > >> > > sage > > > >> > > > > > >> > > > > > >> > > On Wed, 3 Apr 2019, Shawn Edwards wrote: > > > >> > > > > > >> > > > No strange auth config: > > > >> > > > > > > >> > > > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)' > > > >> > > > global advanced auth_client_required cephx > > > >> > > > * > > > >> > > > global advanced auth_cluster_required cephx > > > >> > > > * > > > >> > > > global advanced auth_service_required cephx > > > >> > > > * > > > >> > > > > > > >> > > > All boxes are using 'minimal' ceph.conf files and centralized > > > >> config. > > > >> > > > > > > >> > > > If you need the full config, it's here: > > > >> > > > > > https://gist.github.com/lesserevil/3b82d37e517f4561ce53c81629717aae > > > >> > > > > > > >> > > > On Wed, Apr 3, 2019 at 4:07 PM Sage Weil <sage@xxxxxxxxxxxx> > > wrote: > > > >> > > > > > > >> > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote: > > > >> > > > > > Recent nautilus upgrade from mimic. No issues on mimic. > > > >> > > > > > > > > >> > > > > > Now getting this or similar in all osd logs, there is very > > > >> little osd > > > >> > > > > > communicatoin, and most of the PG are either 'down' or > > > >> 'unknown', > > > >> > > even > > > >> > > > > > though I can see the data on the filestores. > > > >> > > > > > > > > >> > > > > > 2019-04-03 13:47:55.280 7f13346e3700 0 --1- [v2: > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1: > > > >> > > 10.36.9.37:6821/8825 > > > >> > > > > > conn(0xa7132000 0xa6b28000 :-1 s=CONNECTING_SEND_CONNECT_MSG > > > >> pgs=0 > > > >> > > cs=0 > > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER > > > >> > > > > > 2019-04-03 13:47:55.296 7f1333ee2700 0 --1- [v2: > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1: > > > >> > > > > 10.36.9.37:6841/11204 > > > >> > > > > > conn(0xa9826d00 0xa9b78000 :-1 s=CONNECTING_SEND_CONNECT_MSG > > > >> pgs=0 > > > >> > > cs=0 > > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER > > > >> > > > > > 2019-04-03 13:47:55.340 7f13346e3700 0 --1- [v2: > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1: > > > >> > > 10.36.9.37:6829/8425 > > > >> > > > > > conn(0xa7997180 0xaeb22800 :-1 s=CONNECTING_SEND_CONNECT_MSG > > > >> pgs=0 > > > >> > > cs=0 > > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 auth: could not find > > > >> > > > > secret_id=41687 > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 cephx: > > verify_authorizer > > > >> > > could > > > >> > > > > not > > > >> > > > > > get service secret for service osd secret_id=41687 > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 --1- [v2: > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1: > > > >> > > > > 10.36.9.48:6805/49547 > > > >> > > > > > conn(0xe02f24480 0xe088cb800 :6803 > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH > > > >> > > > > pgs=0 > > > >> > > > > > cs=0 l=0).handle_connect_message_2: got bad authorizer, > > > >> > > auth_reply_len=0 > > > >> > > > > > > > > >> > > > > > Thoughts? I have confirmed that all ceph boxes have good > > time > > > >> sync. > > > >> > > > > > > > >> > > > > Do you have any non-default auth-related settings in > > ceph.conf? > > > >> > > > > > > > >> > > > > sage > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > -- > > > >> > > > Shawn Edwards > > > >> > > > Beware programmers with screwdrivers. They tend to spill them > > on > > > >> their > > > >> > > > keyboards. > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > -- > > > >> > Shawn Edwards > > > >> > Beware programmers with screwdrivers. They tend to spill them on > > their > > > >> > keyboards. > > > >> > > > > >> > > > > > > > > > > -- > > > Shawn Edwards > > > Beware programmers with screwdrivers. They tend to spill them on their > > > keyboards. > > > > > > > > -- > Shawn Edwards > Beware programmers with screwdrivers. They tend to spill them on their > keyboards. > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com