Re: per client mds throttle

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 03 Jul 2019 11:36:04 -0400

On Wed, 2019-07-03 at 13:44 +0200, Dan van der Ster wrote:
> On Wed, Jul 3, 2019 at 12:30 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Tue, 2019-07-02 at 17:24 +0200, Dan van der Ster wrote:
> > > Hi,
> > > 
> > > Are there any plans to implement a per-client throttle on mds client requests?
> > > 
> > > We just had an interesting case where a new cephfs user was hammering
> > > an mds from several hosts. In the end we found that their code was
> > > doing:
> > > 
> > >   while d=getafewbytesofdata():
> > >     f=open(file.dat)
> > >     f.append(d)
> > >     f.close()
> > > 
> > > By changing their code to:
> > > 
> > >   f=open(file.dat)
> > >   while d=getafewbytesofdata():
> > >     f.append(d)
> > >   f.close()
> > > 
> > > it completely removes their load on the mds (for obvious reasons).
> > > 
> > > In a multi-user environment it's hard to scrutinize every user's
> > > application, so we'd prefer to just throttle down the client req rates
> > > (and let them suffer from the poor performance).
> > > 
> > > Thoughts?
> > > 
> > > 
> > 
> > (cc'ing Xuehan)
> > 
> > It sounds like a reasonable thing to do at first glance. There was a
> > patchset recently by Xuehan Xu to add a new io controller policy for
> > cephfs, but that was more focused around OSD ops on behalf of cephfs
> > clients, fwiw, but that's not quite what you're asking about.
> > 
> > The challenge with all of these sorts of throttling schemes is how to
> > parcel things out to individual clients. MDS/OSD ops are not a discrete
> > resource, and it's difficult to gauge how much to allocate to each
> > client.
> > 
> > I think if we were going to do something along these lines, it'd be good
> > to work out how you'd throttle both MDS and OSD ops to keep a lid on
> > things. That said, this is not a trivial problem to tackle, IMO.
> > 
> 
> Thanks for the reply, Jeff.
> At the moment I'm only considering adding a simple throttle on the MDS
> client ops. From the practical standpoint, we have seen clients
> overloading MDS's already, but haven't suffered from any OSD related
> load issues.
> Plus, the distributed QoS stuff should better solve the OSD problem, AFAIU.
> 
> > Some questions to get you started should you choose to pursue this:
> > 
> > - Will you throttle these ops at the MDS or on the clients? Ditto for
> > the OSDs...
> > 
> > - How will it work? Will there be a fixed cap of some sort for a given
> > amount of time, or are you more looking to just delay processing ops for
> > a single client when it's "too busy"?
> 
> My basic idea is to add this machinery early on in handle_client_request:
>   - record each Session's last req timestamp
>   - if now()-last_req_timestamp for a given req/session is less than a
> configurable delay, inject a delay. (e.g.
> mds_per_client_request_sleep, defaults to 0, we'd use 0.01 to throttle
> clients to 100Hz each)
> 
> That said, I haven't understood exactly how to inject that delay just
> yet. Is h_c_r async per req, or is it looping with one thread over the
> queued requests? If it's async per Session or req, then we could just
> sleep right there in h_c_r. If h_c_r is handled by one thread, we need
> to be more clever. Is there a standard way to tell a client to retry
> the req after some delay?
> 
> Also, this kind of v1 PoC would obviously have the same
> mds_per_client_request_sleep for all clients. v2 could add a
> configurable sleep for specific clients.
> 
> And in addition to the per-client approach, a second idea would be to
> throttle per mount prefix (which would be useful in cases of multiple
> clients accessing the same path, e.g. multi-tenant with Manila).
> A simple way to achieve this would be to use the session's
> client_metadata.root as a key in a hash of last req time (per mount
> root), delaying requests as needed (like above).
> 
> The holy grail path throttle solution would be to allow throttling per
> subpath, e.g. for a home directory use-case where you have 10000
> subdirs in /home/, and we want to throttle any /home/{*}/ to 100Hz.
> This could be exposed as an xattr on a directory, but for each request
> we'd have to resolve the path upwards to find a req/s quota (like we
> do for space quotas) and sleep accordingly.
> 

Ok, so you're looking at doing something on the MDS. I'll leave the
specifics of how that should be implemented to those more familiar with
the MDS code.

That said, i might be nice to make that somewhat adaptive to allow the
client to have short periods of "spiky" MDS activity without delays.
That could be altered after the basic, simple method you describe above
is implemented however.

> > - If you're thinking of something more like a cgroup, how will you
> > determine how large a pool of operations you will have, and can parcel
> > out to each client? If you've parceled out 100% of your MDS ops budget,
> > how will you rebalance things when new clients are added or removed from
> > the cluster?
> 
> I'm not a fan of the cgroup approach because it's nicer if the
> throttling can be enforced/configured dynamically on the server-side.
> The simple sleep proposed above is inspired by the various osd sleeps
> that we've added over the years -- they turn out to be super effective
> for busy prod clusters.
> If we realize in prod that the sleep is too aggressive, we can just
> lower it on the mds's as needed :)
> 
> > - if a client is holding a file lock, then throttling it could delay it
> > releasing locks and that could slow down other (mostly idle) clients
> > that are contending for it. Do we care? How will we deal with that
> > situation if so?
> 
> That would indeed be a concern, but my understanding from checking
> dispatch_client_request is that these are all client md ops like
> lookup, create, rm, setxattr; and not the responses to a revoke cap
> request from the mds to the client.
> Is that write?
> 

Sort of. I was mostly thinking of flock()/fcntl() style locks. AFAICT,
dispatch_client_request also handles CEPH_MDS_OP_SETFILELOCK too. You'd
probably want to avoid delaying any of those calls that are releasing a
file lock, but it would probably be ok to delay requests for new locks.

But, you bring up a good point; caps and dentry leases are really just a
different form of revocable lock and those can be contended as well.

I think as an overall design goal, you'd want to focus on delaying
requests for "new state" (opens, locks, caps, directory leases), but
things like cap or dentry releases should always be processed ASAP.

That may be a bit more involved than what you were originally
suggesting...
-- 
Jeff Layton <jlayton@xxxxxxxxxx>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx