Re: client feature requirements appear very broken

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 18, 2018 at 8:30 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Wed, 17 Oct 2018, Gregory Farnum wrote:
> > I was given a cluster recently where they'd managed to enable upmap
> > despite clients not supporting it, and then on removing the upmap
> > features the clients were triggering monitor crashes.
> >
> > While trying to establish what had happened, I discovered that most of
> > the controls in this area have become pretty stale or broken!
> >
> > So far as I can tell:
> >
> > * Admins can invoke "osd set -require-min-compat-client", and unless
> > you override it tries check against already-connected clients
>
> Right. The check against connected clients is imperfect, but better than
> nothing.  The FeatureMap tracks entity_type -> features -> count on each
> mon, and those are shared by the lease ack messages to teh leader
> mon (who is the only one with a full view of all connected clients).

Ah I'd missed that portion of the functionality, I thought it was just
grabbing the other mon's messenger features. Nice!

>
> >   * but the check makes use of ceph_release_features() (which I've
> > recently noted to be a bit weak) and also only checks against the
> > entities which are connected to the local monitor making the change.
>
> IIRC the issue there is that there aren't actually any new client/cluster
> (i.e., features reflected in the kernel client) features that we've
> introduced in mimic, so it can't tell the two apart.  We will have the
> msgr2 feature bit for nautilus, though, so I think mimic will be the only
> oddity here.
>
> > * As best I can tell, setting a min_compat_client does *not* actually
> > require newly-connected entities to have the required feature set!
>
> Correct.  It's a guard against making cluster changes that will break the
> stated min_compat_client target, but doesn't actually lock out old clients
> in and of itself.  We could change that, I suppose.. I can't rembmer if we
> had a broader discussion about it leading up to luminous?
>
> > * The OSDMonitor DOES require any connecting entity satisfy the
> > features demanded by OSDMap::get_features() (via
> > update_msgr_features(), called from OSDMonitor::update_from_paxos())
> >   * but this function has not been reliably updated? It adds
> > CEPH_FEATUREMASK_SERVER_KRAKEN for OSDs (but not others!) and is
> > configured to add CEPH_FEATUREMASK_CEPHX_V2 when appropriate, but has
> > nothing from luminous (like, say...UPMAP).
>
> I think there just weren't obvious OSDMap-specific features in luminous or
> mimic to add.  There *are* crushmap features (the weight-sets), covered by
>
>   if (crush->has_incompat_choose_args()) {
>     features |= CEPH_FEATUREMASK_CRUSH_CHOOSE_ARGS;
>   }
>
> at the top of the function.  And in master there is the CEPHX_v2
> requirement for osds if require_osd_release >= nautilus (doesn't affect
> clients).

Oh, hrm. I now finally realize that a bunch of the luminous features
got added at the top of that function instead of the bottom. Not sure
how I missed that in searching. *sigh*


> > Am I misunderstanding something here? Am I correct in thinking that we
> > just need to wire up require_min_compat_client to
> > OSDMap::get_features(), or is there another enforcement mechanism
> > we're looking for?
>
> It depends on the goal... if we want to gate actual client connections
> based on require_min_compat_client, then yeah, I think that would do it.
> Maybe that's a good idea based on the principle of least surprise (because
> a change other than changing min_compat_client won't affect clients'
> ability to connect).
>
> Which sounds pretty reasonable, when I think about it now!

Yeah I'll need to look back at the data I have and make sure I
completely understand what happened, but I think we need to do
something like this (while figuring out some way of being sensitive to
the kernel client per-feature abilities, as Ilya points out).

Or else we need some other way to deal with the situation that appears
to have developed:

1) admin sets require_min_compat_client high enough to let them enable
upmap (...apparently with the --force option?)
2) admin turns on upmap
3) admin discovers some important kernel clients don't support upmap
4) admin turns off upmap
5) kernel clients have never been turned off and kept trying to
reconnect to monitors for the "next" osdmap (which enabled upmap)
6) monitor now lets them in and tries to encode upmap-noting osdmap for them
7) monitor crashes because it asserts against encoding upmap for a
connection that doesn't understand upmap.

I'll poke around a bit and see if I can find an appropriate way to
deal with the kernel clients supporting feature subsets.
-Greg



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux