Re: per client mds throttle

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 3 Jul 2019 10:33:48 -0700

Unfortunately this is VERY complicated.

On Wed, Jul 3, 2019 at 4:45 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> On Wed, Jul 3, 2019 at 12:30 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> > On Tue, 2019-07-02 at 17:24 +0200, Dan van der Ster wrote:
> > > Hi,
> > >
> > > Are there any plans to implement a per-client throttle on mds client requests?
> > >
> > > We just had an interesting case where a new cephfs user was hammering
> > > an mds from several hosts. In the end we found that their code was
> > > doing:
> > >
> > >   while d=getafewbytesofdata():
> > >     f=open(file.dat)
> > >     f.append(d)
> > >     f.close()
> > >
> > > By changing their code to:
> > >
> > >   f=open(file.dat)
> > >   while d=getafewbytesofdata():
> > >     f.append(d)
> > >   f.close()
> > >
> > > it completely removes their load on the mds (for obvious reasons).
> > >
> > > In a multi-user environment it's hard to scrutinize every user's
> > > application, so we'd prefer to just throttle down the client req rates
> > > (and let them suffer from the poor performance).
> > >
> > > Thoughts?
> > >
> > >
> >
> > (cc'ing Xuehan)
> >
> > It sounds like a reasonable thing to do at first glance. There was a
> > patchset recently by Xuehan Xu to add a new io controller policy for
> > cephfs, but that was more focused around OSD ops on behalf of cephfs
> > clients, fwiw, but that's not quite what you're asking about.
> >
> > The challenge with all of these sorts of throttling schemes is how to
> > parcel things out to individual clients. MDS/OSD ops are not a discrete
> > resource, and it's difficult to gauge how much to allocate to each
> > client.
> >
> > I think if we were going to do something along these lines, it'd be good
> > to work out how you'd throttle both MDS and OSD ops to keep a lid on
> > things. That said, this is not a trivial problem to tackle, IMO.
> >
>
> Thanks for the reply, Jeff.
> At the moment I'm only considering adding a simple throttle on the MDS
> client ops. From the practical standpoint, we have seen clients
> overloading MDS's already, but haven't suffered from any OSD related
> load issues.
> Plus, the distributed QoS stuff should better solve the OSD problem, AFAIU.
>
> > Some questions to get you started should you choose to pursue this:
> >
> > - Will you throttle these ops at the MDS or on the clients? Ditto for
> > the OSDs...
> >
> > - How will it work? Will there be a fixed cap of some sort for a given
> > amount of time, or are you more looking to just delay processing ops for
> > a single client when it's "too busy"?
>
> My basic idea is to add this machinery early on in handle_client_request:
>   - record each Session's last req timestamp
>   - if now()-last_req_timestamp for a given req/session is less than a
> configurable delay, inject a delay. (e.g.
> mds_per_client_request_sleep, defaults to 0, we'd use 0.01 to throttle
> clients to 100Hz each)
>
> That said, I haven't understood exactly how to inject that delay just
> yet. Is h_c_r async per req, or is it looping with one thread over the
> queued requests?

There's a single dispatch thread that runs all pending incoming
messages. Either you process the message completely and send any
necessary replies, or you put the message on some kind of waitlist
that gets picked up later (either automatically by the dispatch
thread, or when a Timer goes off and retries it).

> If it's async per Session or req, then we could just
> sleep right there in h_c_r. If h_c_r is handled by one thread, we need
> to be more clever. Is there a standard way to tell a client to retry
> the req after some delay?

Nope. You'd probably have to create a new ordered queue of delayed
messages, and on every loop in _dispatch() check to see if the front
timestamp is allowed to be processed yet.

That said, this isn't actually nearly sufficient, because...

>
> Also, this kind of v1 PoC would obviously have the same
> mds_per_client_request_sleep for all clients. v2 could add a
> configurable sleep for specific clients.
>
> And in addition to the per-client approach, a second idea would be to
> throttle per mount prefix (which would be useful in cases of multiple
> clients accessing the same path, e.g. multi-tenant with Manila).
> A simple way to achieve this would be to use the session's
> client_metadata.root as a key in a hash of last req time (per mount
> root), delaying requests as needed (like above).
>
> The holy grail path throttle solution would be to allow throttling per
> subpath, e.g. for a home directory use-case where you have 10000
> subdirs in /home/, and we want to throttle any /home/{*}/ to 100Hz.
> This could be exposed as an xattr on a directory, but for each request
> we'd have to resolve the path upwards to find a req/s quota (like we
> do for space quotas) and sleep accordingly.
>
> > - If you're thinking of something more like a cgroup, how will you
> > determine how large a pool of operations you will have, and can parcel
> > out to each client? If you've parceled out 100% of your MDS ops budget,
> > how will you rebalance things when new clients are added or removed from
> > the cluster?
>
> I'm not a fan of the cgroup approach because it's nicer if the
> throttling can be enforced/configured dynamically on the server-side.
> The simple sleep proposed above is inspired by the various osd sleeps
> that we've added over the years -- they turn out to be super effective
> for busy prod clusters.
> If we realize in prod that the sleep is too aggressive, we can just
> lower it on the mds's as needed :)
>
> > - if a client is holding a file lock, then throttling it could delay it
> > releasing locks and that could slow down other (mostly idle) clients
> > that are contending for it. Do we care? How will we deal with that
> > situation if so?
>
> That would indeed be a concern, but my understanding from checking
> dispatch_client_request is that these are all client md ops like
> lookup, create, rm, setxattr; and not the responses to a revoke cap
> request from the mds to the client.
> Is that write?

Unfortunately the load you're seeing from client opens and closes is
not one of the reasonably easy-to-handle request types, but an
MClientCaps message.
Delaying these could generally be VERY BAD, because that's how clients
acknowledge file capability changes from the MDS when revoking caps,
and do other things that directly feed into the system's performance
and correctness.

I don't *think* you could get any reordering issues as long as it's
just a delay on message delivery from clients, but you'd definitely
see things like much worse behavior on any shared systems with Fw
access for multiple clients.

These are all problems that will eventually need to be solved for QoS
reasons, but any quick fixes are going to be very dependent on the use
case to be reliable.

Given the example you have of a "misbehaving" client, it may be
simpler to provide options for de-tuning MDS performance a bit. That
open/close flush behavior you and Sage discuss was added specifically
to address some performance or cache size issue but could definitely
be configurable.
-Greg

>
> Thanks!
>
> Dan
>
> > - Would you need separate tunables for OSD and MDS ops, or is there some
> > way to tune both under a single knob?
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> >
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx