Re: BADAUTHORIZER in Nautilus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



That log shows

2019-04-03 15:39:53.299 7f3733f18700 10 monclient: tick
2019-04-03 15:39:53.299 7f3733f18700 10 cephx: validate_tickets want 53 have 53 need 0
2019-04-03 15:39:53.299 7f3733f18700 20 cephx client: need_tickets: want=53 have=53 need=0
2019-04-03 15:39:53.299 7f3733f18700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-04-03 15:39:23.301595)
2019-04-03 15:39:53.299 7f3733f18700 10 auth: dump_rotating:
2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41691 A4Q== expires 2019-04-03 14:43:07.042860
2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41692 AD9Q== expires 2019-04-03 15:43:09.895511
2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41693 ADQ== expires 2019-04-03 16:43:09.895511

which is all fine.  It is getting BADAUTHORIZER talking to another OSD, 
but I'm guessing it's because that other OSD doesn't have the right 
tickets.  It's hard to tell what's wrong without having al the OSD logs 
and being able to see the matching ticket renewals (or lack thereof) on 
the other end of a specific connection.

I missed this before:

WHO      MASK LEVEL    OPTION                             VALUE                                                        
RO
global        advanced auth_client_required               cephx                                                        
*
global        advanced auth_cluster_required              cephx                                                        
*
global        advanced auth_service_required              cephx                                                        
*

^ Note that these three options aren't actually needed.  The only 
non-default value is auth_client_required, and it's 'cephx, none' meaning 
the client-side code will allow itself to connect to a cluster with auth 
disabled.

global        advanced auth_service_ticket_ttl            3600.000000
global        advanced cephx_sign_messages                false

The fact that the tickets aren't renewing properly suggests that there's 
something broken when the ttl option is modified (although it's not clear 
what yet).  Similarly, disabling message signing isn't common either and 
isn't covered by the automated test suite.  If you're just trying to get 
the cluster up I'd remove one or both of those two options and restart all 
daemons.

Looking closer, I see lots of these:

2019-04-03 15:39:56.435 7f374a1fb700 10 _calc_signature seq 46 front_crc_ = 1548381741 middle_crc = 0 data_crc = 0 sig = 7970349432988260882
2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46 Sender did not set CEPH_MSG_FOOTER_SIGNED.
2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46 Message signature does not match contents.
2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46Signature on message:
2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46    sig: 0
2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46Locally calculated signature:
2019-04-03 15:39:56.435 7f374a1fb700  0 SIGN: MSG 46    sig_check:7970349432988260882
2019-04-03 15:39:56.435 7f374a1fb700  0 Signature failed.
2019-04-03 15:39:56.435 7f374a1fb700  0 --1- v1:10.36.9.48:6817/47343 >> v1:10.36.9.37:6849/12739 conn(0x58ef0400 0x3c2fd800 :6817 s=READ_FOOTER_AND_DISPATCH pgs=3071352 cs=412808 l=0).handle_message_footer Signature check failed

I'm not sure which peer it's talking to here, but it looks like the 
cephx_sign_messages setting isn't being uniformly applied.  Try removing 
that option!

(Out of curiosity, why was it disabled?)

sage




On Wed, 3 Apr 2019, Shawn Edwards wrote:
> ceph-post-file: 789769c4-e7e4-47d3-8fb7-475ea4cfe14a
> 
> This should have the information you need.
> 
> On Wed, Apr 3, 2019 at 5:49 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> 
> > This OSD also appears on teh accepting end of things, and probably
> > has newer keys that the OSD connecting (tho it' shard to tell
> > because teh debug level isn't turned up).
> >
> > The goal is to find an osd still throwing BADAUTHORIZER messages and then
> > turn of debug_auth=20 and debug_monc=20.  Specifically I'm looking for the
> > output from MonClient::_check_auth_rotating(), which should tell us
> > whether the OSD thinks its rotating keys are up to date, or whether it is
> > having trouble renewing them, or what.  It's weird that some OSDs are
> > using hours-old keys to try to authenticate.  :/
> >
> > sage
> >
> >
> > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> >
> > > ceph-post-file: 60e07a0c-ee5b-4174-9f51-fa091d5662dc
> > >
> > > On Wed, Apr 3, 2019 at 5:30 PM Shawn Edwards <lesser.evil@xxxxxxxxx>
> > wrote:
> > >
> > > > According to ceph versions, all bits are running 14.2.0
> > > >
> > > > I have restarted all of the OSD at least twice and am still getting the
> > > > same error.
> > > >
> > > > I'll send a log file with confirmed interesting bad behavior shortly
> > > >
> > > > On Wed, Apr 3, 2019, 17:17 Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > >
> > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 --1- v1:
> > 10.36.9.46:6813/5003637
> > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813
> > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=0).handle_connect_message_2
> > > >> authorizor_protocol 2 len 174
> > > >> 2019-04-03 15:04:01.986 7ffae5778700 20 AuthRegistry(0xcd64a40)
> > > >> get_handler peer_type 4 method 2 cluster_methods [2] service_methods
> > [2]
> > > >> client_methods [2]
> > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 cephx: verify_authorizer
> > > >> decrypted service osd secret_id=41686
> > > >> 2019-04-03 15:04:01.986 7ffae5778700  0 auth: could not find
> > > >> secret_id=41686
> > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: dump_rotating:
> > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth:  id 41691 ... expires
> > > >> 2019-04-03 14:43:07.042860
> > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth:  id 41692 ... expires
> > > >> 2019-04-03 15:43:09.895511
> > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth:  id 41693 ... expires
> > > >> 2019-04-03 16:43:09.895511
> > > >> 2019-04-03 15:04:01.986 7ffae5778700  0 cephx: verify_authorizer could
> > > >> not get service secret for service osd secret_id=41686
> > > >> 2019-04-03 15:04:01.986 7ffae5778700  0 --1- v1:
> > 10.36.9.46:6813/5003637
> > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813
> > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=0).handle_connect_message_2:
> > > >> got bad authorizer, auth_reply_len=0
> > > >>
> > > >> For some reason this OSD has much newer rotating keys than the
> > > >> connecting OSD.  But earlier in the day, this osd was the one
> > > >> getting BADAUTHORIZER, so maybe that shifted.  Can you find an OSD
> > where
> > > >> you still see BADAUTHORIZER appearing in the log?
> > > >>
> > > >> My guess is that if you restart the OSDs, they'll get fresh rotating
> > keys
> > > >> and things will be fine.  But that doesn't explain why they're not
> > > >> renewing on their own right now.. that I'm not so sure about.
> > > >>
> > > >> Are your mons all running nautilus?  Does 'ceph versions' show
> > everything
> > > >> has upgraded?
> > > >>
> > > >> sage
> > > >>
> > > >>
> > > >> On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > >>
> > > >> > File uploaded: f1a2bfb3-92b4-495c-8706-f99cb228efc7
> > > >> >
> > > >> > On Wed, Apr 3, 2019 at 4:57 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > >> >
> > > >> > > Hmm, that doesn't help.
> > > >> > >
> > > >> > > Can you set
> > > >> > >
> > > >> > >  ceph config set osd debug_ms 20
> > > >> > >  ceph config set osd debug_auth 20
> > > >> > >  ceph config set osd debug_monc 20
> > > >> > >
> > > >> > > for a few minutes and ceph-post-file the osd logs?  (Or send a
> > private
> > > >> > > email with a link or something.)
> > > >> > >
> > > >> > > Thanks!
> > > >> > > sage
> > > >> > >
> > > >> > >
> > > >> > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > >> > >
> > > >> > > > No strange auth config:
> > > >> > > >
> > > >> > > > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> > > >> > > > global        advanced auth_client_required               cephx
> > > >> > > >                                             *
> > > >> > > > global        advanced auth_cluster_required              cephx
> > > >> > > >                                             *
> > > >> > > > global        advanced auth_service_required              cephx
> > > >> > > >                                             *
> > > >> > > >
> > > >> > > > All boxes are using 'minimal' ceph.conf files and centralized
> > > >> config.
> > > >> > > >
> > > >> > > > If you need the full config, it's here:
> > > >> > > >
> > https://gist.github.com/lesserevil/3b82d37e517f4561ce53c81629717aae
> > > >> > > >
> > > >> > > > On Wed, Apr 3, 2019 at 4:07 PM Sage Weil <sage@xxxxxxxxxxxx>
> > wrote:
> > > >> > > >
> > > >> > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > >> > > > > > Recent nautilus upgrade from mimic.  No issues on mimic.
> > > >> > > > > >
> > > >> > > > > > Now getting this or similar in all osd logs, there is very
> > > >> little osd
> > > >> > > > > > communicatoin, and most of the PG are either 'down' or
> > > >> 'unknown',
> > > >> > > even
> > > >> > > > > > though I can see the data on the filestores.
> > > >> > > > > >
> > > >> > > > > > 2019-04-03 13:47:55.280 7f13346e3700  0 --1- [v2:
> > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > >> > > 10.36.9.37:6821/8825
> > > >> > > > > > conn(0xa7132000 0xa6b28000 :-1 s=CONNECTING_SEND_CONNECT_MSG
> > > >> pgs=0
> > > >> > > cs=0
> > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > >> > > > > > 2019-04-03 13:47:55.296 7f1333ee2700  0 --1- [v2:
> > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > >> > > > > 10.36.9.37:6841/11204
> > > >> > > > > > conn(0xa9826d00 0xa9b78000 :-1 s=CONNECTING_SEND_CONNECT_MSG
> > > >> pgs=0
> > > >> > > cs=0
> > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > >> > > > > > 2019-04-03 13:47:55.340 7f13346e3700  0 --1- [v2:
> > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > >> > > 10.36.9.37:6829/8425
> > > >> > > > > > conn(0xa7997180 0xaeb22800 :-1 s=CONNECTING_SEND_CONNECT_MSG
> > > >> pgs=0
> > > >> > > cs=0
> > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700  0 auth: could not find
> > > >> > > > > secret_id=41687
> > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700  0 cephx:
> > verify_authorizer
> > > >> > > could
> > > >> > > > > not
> > > >> > > > > > get service secret for service osd secret_id=41687
> > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700  0 --1- [v2:
> > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > >> > > > > 10.36.9.48:6805/49547
> > > >> > > > > > conn(0xe02f24480 0xe088cb800 :6803
> > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH
> > > >> > > > > pgs=0
> > > >> > > > > > cs=0 l=0).handle_connect_message_2: got bad authorizer,
> > > >> > > auth_reply_len=0
> > > >> > > > > >
> > > >> > > > > > Thoughts?  I have confirmed that all ceph boxes have good
> > time
> > > >> sync.
> > > >> > > > >
> > > >> > > > > Do you have any non-default auth-related settings in
> > ceph.conf?
> > > >> > > > >
> > > >> > > > > sage
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > >  Shawn Edwards
> > > >> > > >  Beware programmers with screwdrivers.  They tend to spill them
> > on
> > > >> their
> > > >> > > > keyboards.
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >  Shawn Edwards
> > > >> >  Beware programmers with screwdrivers.  They tend to spill them on
> > their
> > > >> > keyboards.
> > > >> >
> > > >>
> > > >
> > >
> > > --
> > >  Shawn Edwards
> > >  Beware programmers with screwdrivers.  They tend to spill them on their
> > > keyboards.
> > >
> >
> 
> 
> -- 
>  Shawn Edwards
>  Beware programmers with screwdrivers.  They tend to spill them on their
> keyboards.
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux