Update:
I think we have a work-around, but no root cause yet.
What is working is removing the 'v2' bits from the ceph.conf file across the cluster, and turning off all cephx authentication. Now everything seems to be talking correctly other than some odd metrics around the edges.
Here's my current ceph.conf, running on all ceph hosts and clients:
[global]
fsid = 3f390b5e-2b1d-4a2f-ba00-xxxxxxxxxxxx
auth_client_required = none
auth_cluster_required = none
auth_service_required = none
If we get better information as to what's going on, I'll post here for future reference
On Thu, Apr 4, 2019 at 9:16 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Thu, 4 Apr 2019, Shawn Edwards wrote:
> It was disabled in a fit of genetic debugging. I've now tried to revert
> all config settings related to auth and signing to defaults.
>
> I can't seem to change the auth_*_required settings. If I try to remove
> them, they stay set. If I try to change them, I get both the old and new
> settings:
>
> root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> global advanced auth_client_required cephx
> *
> global advanced auth_cluster_required cephx
> *
> global advanced auth_service_required cephx
> *
> root@tyr-ceph-mon0:~# ceph config rm global auth_service_required
> root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> global advanced auth_client_required cephx
> *
> global advanced auth_cluster_required cephx
> *
> global advanced auth_service_required cephx
> *
> root@tyr-ceph-mon0:~# ceph config set global auth_service_required none
> root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> global advanced auth_client_required cephx
> *
> global advanced auth_cluster_required cephx
> *
> global advanced auth_service_required none
> *
> global advanced auth_service_required cephx
> *
>
> I know these are set to RO, but according to your blog posts, this means
> they don't get updated until a daemon restart. Does this look correct to
> you? I'm assuming I need to restart all daemons on all hosts. Is this
> correct?
Yeah, that is definitely not behaving properly. Can you try "ceph
config-key dump | grep config/" to look at how those keys are stored? You
should see something like
"config/auth_cluster_required": "cephx",
"config/auth_service_required": "cephx",
"config/auth_service_ticket_ttl": "3600.000000",
but maybe those names are formed differently, maybe with ".../global/..."
in there? My guess is a subtle naming behavior change between mimic or
something. You can remove the keys via the config-key interface and then
restart the mons (or adjust any random config option) to make the
mons refresh. After that config dump should show the right thing.
Maybe a disagreement/confusion about the actual value of
auth_service_ticket_ttl is the cause of this. You might try doing 'ceph
config show osd.0' and/or a mon to see what value for the auth options the
daemons are actually using and reporting...
sage
>
> On Thu, Apr 4, 2019 at 5:54 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> > That log shows
> >
> > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: tick
> > 2019-04-03 15:39:53.299 7f3733f18700 10 cephx: validate_tickets want 53
> > have 53 need 0
> > 2019-04-03 15:39:53.299 7f3733f18700 20 cephx client: need_tickets:
> > want=53 have=53 need=0
> > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: _check_auth_rotating
> > have uptodate secrets (they expire after 2019-04-03 15:39:23.301595)
> > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: dump_rotating:
> > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41691 A4Q== expires
> > 2019-04-03 14:43:07.042860
> > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41692 AD9Q== expires
> > 2019-04-03 15:43:09.895511
> > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: id 41693 ADQ== expires
> > 2019-04-03 16:43:09.895511
> >
> > which is all fine. It is getting BADAUTHORIZER talking to another OSD,
> > but I'm guessing it's because that other OSD doesn't have the right
> > tickets. It's hard to tell what's wrong without having al the OSD logs
> > and being able to see the matching ticket renewals (or lack thereof) on
> > the other end of a specific connection.
> >
> > I missed this before:
> >
> > WHO MASK LEVEL OPTION VALUE
> >
> > RO
> > global advanced auth_client_required cephx
> >
> > *
> > global advanced auth_cluster_required cephx
> >
> > *
> > global advanced auth_service_required cephx
> >
> > *
> >
> > ^ Note that these three options aren't actually needed. The only
> > non-default value is auth_client_required, and it's 'cephx, none' meaning
> > the client-side code will allow itself to connect to a cluster with auth
> > disabled.
> >
> > global advanced auth_service_ticket_ttl 3600.000000
> > global advanced cephx_sign_messages false
> >
> > The fact that the tickets aren't renewing properly suggests that there's
> > something broken when the ttl option is modified (although it's not clear
> > what yet). Similarly, disabling message signing isn't common either and
> > isn't covered by the automated test suite. If you're just trying to get
> > the cluster up I'd remove one or both of those two options and restart all
> > daemons.
> >
> > Looking closer, I see lots of these:
> >
> > 2019-04-03 15:39:56.435 7f374a1fb700 10 _calc_signature seq 46 front_crc_
> > = 1548381741 middle_crc = 0 data_crc = 0 sig = 7970349432988260882
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 Sender did not set
> > CEPH_MSG_FOOTER_SIGNED.
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 Message signature
> > does not match contents.
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46Signature on message:
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46 sig: 0
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46Locally calculated
> > signature:
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 SIGN: MSG 46
> > sig_check:7970349432988260882
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 Signature failed.
> > 2019-04-03 15:39:56.435 7f374a1fb700 0 --1- v1:10.36.9.48:6817/47343 >>
> > v1:10.36.9.37:6849/12739 conn(0x58ef0400 0x3c2fd800 :6817
> > s=READ_FOOTER_AND_DISPATCH pgs=3071352 cs=412808 l=0).handle_message_footer
> > Signature check failed
> >
> > I'm not sure which peer it's talking to here, but it looks like the
> > cephx_sign_messages setting isn't being uniformly applied. Try removing
> > that option!
> >
> > (Out of curiosity, why was it disabled?)
> >
> > sage
> >
> >
> >
> >
> > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > ceph-post-file: 789769c4-e7e4-47d3-8fb7-475ea4cfe14a
> > >
> > > This should have the information you need.
> > >
> > > On Wed, Apr 3, 2019 at 5:49 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > >
> > > > This OSD also appears on teh accepting end of things, and probably
> > > > has newer keys that the OSD connecting (tho it' shard to tell
> > > > because teh debug level isn't turned up).
> > > >
> > > > The goal is to find an osd still throwing BADAUTHORIZER messages and
> > then
> > > > turn of debug_auth=20 and debug_monc=20. Specifically I'm looking for
> > the
> > > > output from MonClient::_check_auth_rotating(), which should tell us
> > > > whether the OSD thinks its rotating keys are up to date, or whether it
> > is
> > > > having trouble renewing them, or what. It's weird that some OSDs are
> > > > using hours-old keys to try to authenticate. :/
> > > >
> > > > sage
> > > >
> > > >
> > > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > >
> > > > > ceph-post-file: 60e07a0c-ee5b-4174-9f51-fa091d5662dc
> > > > >
> > > > > On Wed, Apr 3, 2019 at 5:30 PM Shawn Edwards <lesser.evil@xxxxxxxxx>
> > > > wrote:
> > > > >
> > > > > > According to ceph versions, all bits are running 14.2.0
> > > > > >
> > > > > > I have restarted all of the OSD at least twice and am still
> > getting the
> > > > > > same error.
> > > > > >
> > > > > > I'll send a log file with confirmed interesting bad behavior
> > shortly
> > > > > >
> > > > > > On Wed, Apr 3, 2019, 17:17 Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > > > >
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 --1- v1:
> > > > 10.36.9.46:6813/5003637
> > > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813
> > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > > > l=0).handle_connect_message_2
> > > > > >> authorizor_protocol 2 len 174
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 20 AuthRegistry(0xcd64a40)
> > > > > >> get_handler peer_type 4 method 2 cluster_methods [2]
> > service_methods
> > > > [2]
> > > > > >> client_methods [2]
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 cephx: verify_authorizer
> > > > > >> decrypted service osd secret_id=41686
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 auth: could not find
> > > > > >> secret_id=41686
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: dump_rotating:
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41691 ...
> > expires
> > > > > >> 2019-04-03 14:43:07.042860
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41692 ...
> > expires
> > > > > >> 2019-04-03 15:43:09.895511
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 10 auth: id 41693 ...
> > expires
> > > > > >> 2019-04-03 16:43:09.895511
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 cephx: verify_authorizer
> > could
> > > > > >> not get service secret for service osd secret_id=41686
> > > > > >> 2019-04-03 15:04:01.986 7ffae5778700 0 --1- v1:
> > > > 10.36.9.46:6813/5003637
> > > > > >> >> v1:10.36.9.28:6809/8224 conn(0xf6a6000 0x30a02000 :6813
> > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > > > l=0).handle_connect_message_2:
> > > > > >> got bad authorizer, auth_reply_len=0
> > > > > >>
> > > > > >> For some reason this OSD has much newer rotating keys than the
> > > > > >> connecting OSD. But earlier in the day, this osd was the one
> > > > > >> getting BADAUTHORIZER, so maybe that shifted. Can you find an OSD
> > > > where
> > > > > >> you still see BADAUTHORIZER appearing in the log?
> > > > > >>
> > > > > >> My guess is that if you restart the OSDs, they'll get fresh
> > rotating
> > > > keys
> > > > > >> and things will be fine. But that doesn't explain why they're not
> > > > > >> renewing on their own right now.. that I'm not so sure about.
> > > > > >>
> > > > > >> Are your mons all running nautilus? Does 'ceph versions' show
> > > > everything
> > > > > >> has upgraded?
> > > > > >>
> > > > > >> sage
> > > > > >>
> > > > > >>
> > > > > >> On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > > > >>
> > > > > >> > File uploaded: f1a2bfb3-92b4-495c-8706-f99cb228efc7
> > > > > >> >
> > > > > >> > On Wed, Apr 3, 2019 at 4:57 PM Sage Weil <sage@xxxxxxxxxxxx>
> > wrote:
> > > > > >> >
> > > > > >> > > Hmm, that doesn't help.
> > > > > >> > >
> > > > > >> > > Can you set
> > > > > >> > >
> > > > > >> > > ceph config set osd debug_ms 20
> > > > > >> > > ceph config set osd debug_auth 20
> > > > > >> > > ceph config set osd debug_monc 20
> > > > > >> > >
> > > > > >> > > for a few minutes and ceph-post-file the osd logs? (Or send a
> > > > private
> > > > > >> > > email with a link or something.)
> > > > > >> > >
> > > > > >> > > Thanks!
> > > > > >> > > sage
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > > > >> > >
> > > > > >> > > > No strange auth config:
> > > > > >> > > >
> > > > > >> > > > root@tyr-ceph-mon0:~# ceph config dump | grep -E
> > '(auth|cephx)'
> > > > > >> > > > global advanced auth_client_required
> > cephx
> > > > > >> > > > *
> > > > > >> > > > global advanced auth_cluster_required
> > cephx
> > > > > >> > > > *
> > > > > >> > > > global advanced auth_service_required
> > cephx
> > > > > >> > > > *
> > > > > >> > > >
> > > > > >> > > > All boxes are using 'minimal' ceph.conf files and
> > centralized
> > > > > >> config.
> > > > > >> > > >
> > > > > >> > > > If you need the full config, it's here:
> > > > > >> > > >
> > > > https://gist.github.com/lesserevil/3b82d37e517f4561ce53c81629717aae
> > > > > >> > > >
> > > > > >> > > > On Wed, Apr 3, 2019 at 4:07 PM Sage Weil <sage@xxxxxxxxxxxx
> > >
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > On Wed, 3 Apr 2019, Shawn Edwards wrote:
> > > > > >> > > > > > Recent nautilus upgrade from mimic. No issues on mimic.
> > > > > >> > > > > >
> > > > > >> > > > > > Now getting this or similar in all osd logs, there is
> > very
> > > > > >> little osd
> > > > > >> > > > > > communicatoin, and most of the PG are either 'down' or
> > > > > >> 'unknown',
> > > > > >> > > even
> > > > > >> > > > > > though I can see the data on the filestores.
> > > > > >> > > > > >
> > > > > >> > > > > > 2019-04-03 13:47:55.280 7f13346e3700 0 --1- [v2:
> > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > > > >> > > 10.36.9.37:6821/8825
> > > > > >> > > > > > conn(0xa7132000 0xa6b28000 :-1
> > s=CONNECTING_SEND_CONNECT_MSG
> > > > > >> pgs=0
> > > > > >> > > cs=0
> > > > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > > > >> > > > > > 2019-04-03 13:47:55.296 7f1333ee2700 0 --1- [v2:
> > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > > > >> > > > > 10.36.9.37:6841/11204
> > > > > >> > > > > > conn(0xa9826d00 0xa9b78000 :-1
> > s=CONNECTING_SEND_CONNECT_MSG
> > > > > >> pgs=0
> > > > > >> > > cs=0
> > > > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > > > >> > > > > > 2019-04-03 13:47:55.340 7f13346e3700 0 --1- [v2:
> > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > > > >> > > 10.36.9.37:6829/8425
> > > > > >> > > > > > conn(0xa7997180 0xaeb22800 :-1
> > s=CONNECTING_SEND_CONNECT_MSG
> > > > > >> pgs=0
> > > > > >> > > cs=0
> > > > > >> > > > > > l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 auth: could not
> > find
> > > > > >> > > > > secret_id=41687
> > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 cephx:
> > > > verify_authorizer
> > > > > >> > > could
> > > > > >> > > > > not
> > > > > >> > > > > > get service secret for service osd secret_id=41687
> > > > > >> > > > > > 2019-04-03 13:47:55.428 7f1334ee4700 0 --1- [v2:
> > > > > >> > > > > > 10.36.9.26:6802/3107,v1:10.36.9.26:6803/3107] >> v1:
> > > > > >> > > > > 10.36.9.48:6805/49547
> > > > > >> > > > > > conn(0xe02f24480 0xe088cb800 :6803
> > > > > >> s=ACCEPTING_WAIT_CONNECT_MSG_AUTH
> > > > > >> > > > > pgs=0
> > > > > >> > > > > > cs=0 l=0).handle_connect_message_2: got bad authorizer,
> > > > > >> > > auth_reply_len=0
> > > > > >> > > > > >
> > > > > >> > > > > > Thoughts? I have confirmed that all ceph boxes have
> > good
> > > > time
> > > > > >> sync.
> > > > > >> > > > >
> > > > > >> > > > > Do you have any non-default auth-related settings in
> > > > ceph.conf?
> > > > > >> > > > >
> > > > > >> > > > > sage
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Shawn Edwards
> > > > > >> > > > Beware programmers with screwdrivers. They tend to spill
> > them
> > > > on
> > > > > >> their
> > > > > >> > > > keyboards.
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Shawn Edwards
> > > > > >> > Beware programmers with screwdrivers. They tend to spill them
> > on
> > > > their
> > > > > >> > keyboards.
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > > > --
> > > > > Shawn Edwards
> > > > > Beware programmers with screwdrivers. They tend to spill them on
> > their
> > > > > keyboards.
> > > > >
> > > >
> > >
> > >
> > > --
> > > Shawn Edwards
> > > Beware programmers with screwdrivers. They tend to spill them on their
> > > keyboards.
> > >
> >
>
>
> --
> Shawn Edwards
> Beware programmers with screwdrivers. They tend to spill them on their
> keyboards.
>
Shawn Edwards
Beware programmers with screwdrivers. They tend to spill them on their keyboards.
Beware programmers with screwdrivers. They tend to spill them on their keyboards.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com