According to ceph versions, all bits are running 14.2.0
I have restarted all of the OSD at least twice and am still getting the same error.
I'll send a log file with confirmed interesting bad behavior shortly
On Wed, Apr 3, 2019, 17:17 Sage Weil <sage@xxxxxxxxxxxx> wrote:
2019-04-03 15:04:01.986 7ffae5778700 10 --1- v1:10.36.9.46:6813/5003637 >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 authorizor_protocol 2 len 174
2019-04-03 15:04:01.986 7ffae5778700 20 AuthRegistry(0xcd64a40) get_handler peer_type 4 method 2 cluster_methods [2] service_methods [2] client_methods [2]
2019-04-03 15:04:01.986 7ffae5778700 10 cephx: verify_authorizer decrypted service osd secret_id=41686
2019-04-03 15:04:01.986 7ffae5778700 0 auth: could not find secret_id=41686
2019-04-03 15:04:01.986 7ffae5778700 10 auth: dump_rotating:
2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41691 ... expires 2019-04-03 14:43:07.042860
2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41692 ... expires 2019-04-03 15:43:09.895511
2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41693 ... expires 2019-04-03 16:43:09.895511
2019-04-03 15:04:01.986 7ffae5778700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=41686
2019-04-03 15:04:01.986 7ffae5778700 0 --1- v1:10.36.9.46:6813/5003637 >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2: got bad authorizer, auth_reply_len=0
For some reason this OSD has much newer rotating keys than the
connecting OSD. But earlier in the day, this osd was the one
getting BADAUTHORIZER, so maybe that shifted. Can you find an OSD where
you still see BADAUTHORIZER appearing in the log?
My guess is that if you restart the OSDs, they'll get fresh rotating keys
and things will be fine. But that doesn't explain why they're not
renewing on their own right now.. that I'm not so sure about.
Are your mons all running nautilus? Does 'ceph versions' show everything
has upgraded?
sage
On Wed, 3 Apr 2019, Shawn Edwards wrote:
> File uploaded: f1a2bfb3-92b4-495c-8706-f99cb228efc7
>
> On Wed, Apr 3, 2019 at 4:57 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> > Hmm, that doesn't help.
> >
> > Can you set
> >
> > ceph config set osd debug_ms 20
> > ceph config set osd debug_auth 20
> > ceph config set osd debug_monc 20
> >
> > for a few minutes and ceph-post-file the osd logs? (Or send a private
> > email with a link or something.)
> >
> > Thanks!
> > sage
> >
> >
> > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> >
> > > No strange auth config:
> > >
> > > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> > > global advanced auth_client_required cephx
> > > *
> > > global advanced auth_cluster_required cephx
> > > *
> > > global advanced auth_service_required cephx
> > > *
> > >
> > > All boxes are using 'minimal' ceph.conf files and centralized config.
> > >
> > > If you need the full config, it's here:
> > > https://gist.github.com/lesserevil/3b82d37e517f4561ce53c81629717aae
> > >
> > > On Wed, Apr 3, 2019 at 4:07 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > >
> > > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > > > Recent nautilus upgrade from mimic. No issues on mimic.
> > > > >
> > > > > Now getting this or similar in all osd logs, there is very little osd
> > > > > communicatoin, and most of the PG are either 'down' or 'unknown',
> > even
> > > > > though I can see the data on the filestores.
> > > > >
> > > > > 2019-04-03 13:47:55.280 7f13346e3700 0 --1- [v2:
> > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > 10.36.9.37:6821/8825
> > > > > conn(0xa7132000 0xa6b28000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0
> > cs=0
> > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > > > 2019-04-03 13:47:55.296 7f1333ee2700 0 --1- [v2:
> > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > > 10.36.9.37:6841/11204
> > > > > conn(0xa9826d00 0xa9b78000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0
> > cs=0
> > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > > > 2019-04-03 13:47:55.340 7f13346e3700 0 --1- [v2:
> > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > 10.36.9.37:6829/8425
> > > > > conn(0xa7997180 0xaeb22800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0
> > cs=0
> > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 auth: could not find
> > > > secret_id=41687
> > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 cephx: verify_authorizer
> > could
> > > > not
> > > > > get service secret for service osd secret_id=41687
> > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 --1- [v2:
> > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > > 10.36.9.48:6805/49547
> > > > > conn(0xe02f24480 0xe088cb800 :6803 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH
> > > > pgs=0
> > > > > cs=0 l=0).handle_connect_message_2: got bad authorizer,
> > auth_reply_len=0
> > > > >
> > > > > Thoughts? I have confirmed that all ceph boxes have good time sync.
> > > >
> > > > Do you have any non-default auth-related settings in ceph.conf?
> > > >
> > > > sage
> > > >
> > >
> > >
> > > --
> > > Shawn Edwards
> > > Beware programmers with screwdrivers. They tend to spill them on their
> > > keyboards.
> > >
> >
>
>
> --
> Shawn Edwards
> Beware programmers with screwdrivers. They tend to spill them on their
> keyboards.
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com