Re: mclock priority queue in radosgw

Matt Benjamin <mbenjami@xxxxxxxxxx> · Thu, 22 Mar 2018 20:06:45 -0400

Restricting scope to RADOS ops doesn't appear to address the broader
motivations for the scheduler, I think.  cf Kyle's mail.

Matt

On Thu, Mar 22, 2018 at 7:26 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> On Thu, Mar 22, 2018 at 5:17 PM, Yehuda Sadeh-Weinraub
> <ysadehwe@xxxxxxxxxx> wrote:
>> On Thu, Mar 22, 2018 at 12:09 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>>> One of the benefits of the asynchronous beast frontend in radosgw is that it
>>> allows us to do things like request throttling and priority queuing that
>>> would otherwise block frontend threads - which are a scarce resource in
>>> civetweb's thread-per-connection model.
>>>
>>> The primary goal of this project is to prevent large object data workloads
>>> from starving out cheaper requests. After some discussion in the Ann Arbor
>>> office, our resident dmclock expert Eric Ivancich convinced us that mclock
>>> was a good fit. I've spent the week exploring a design for this, and wanted
>>> to get some early feedback:
>>>
>>> Each HTTP request would be assigned a request class (dmclock calls them
>>> clients) and a cost.
>>>
>>> The four initial request classes:
>>> - auth: requests for swift auth tokens, and eventually sts
>>> - admin: admin APIs for use by the dashboard and multisite sync
>>> - data: object io
>>> - metadata: everything else, such as bucket operations, object stat, etc.
>>>
>>> Calculating a cost is difficult, especially for the two major cases where
>>> we'd want it: object GET requests (because we have to check with RADOS
>>> before we know its actual size), and object PUT requests that use chunked
>>> transfer-encoding. I'd love to hear ideas for this, but for now I think it's
>>> good enough to assign everything a cost of 1 so that all of the units are in
>>> requests/sec. I believe this is what the osd is doing now as well?
>>>
>>
>> That does sound like the simpler solution that should be good enough
>> starting point. What if we could integrate it in a much lower layer,
>> e.g., into librados?
>
> So a queue for outgoing osd ops instead of http requests? That could
> be interesting. It would certainly better capture the cost for reads
> and writes. cls stuff might be harder to model. I worry about putting
> a queue so close to Objecter's throttles, though - maybe this would
> work best inside the Objecter as a replacement to the throttles?
>
> I think we'd still need something at a higher level though, to prevent
> us from reading in a ton of data from PUT requests before blocking to
> write it out to rados.
>
>>
>>> New virtual functions in class RGWOp seem like a good way for the derived
>>> Ops to return their request class and cost. Once we know those, we can add
>>> ourselves to the mclock priority queue and do an async wait until its our
>>> turn to run.
>>>
>>> But where exactly does this step fit into the request processing pipeline?
>>> Does it happen before or after authentication/authorization? I'm leaning
>>> towards after, so that auth failures get filtered out before they enter the
>>> queue.
>>
>> What about the situation where you have a bad actor flooding with
>> badly authenticated requests?
>
> Yeah, good point. Filtering anything out just means that mclock can't
> do its job to provide fairness for the remaining requests.
>
>>
>>>
>>> The priority queue can use perf counters for introspection, and a config
>>> observer to apply changes to the per-client mclock options.
>>>
>>> As future work, we could add some load balancer integration to:
>>> - enable custom scripts that look at incoming requests and assign their own
>>> request class/cost
>>> - track distributed client stats across gateways, and feed that info back
>>> into radosgw with each request (this is the d in dmclock)
>>>
>>> Thanks,
>>> Casey
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html