Re: mclock priority queue in radosgw

Abhishek Lekshmanan <abhishek@xxxxxxxx> · Mon, 26 Mar 2018 08:18:04 +0200

Casey Bodley <cbodley@xxxxxxxxxx> writes:

> Hi Kyle,
>
> On Thu, Mar 22, 2018 at 7:40 PM, Kyle Bader <kyle.bader@xxxxxxxxx> wrote:
>> From a capacity planning perspective, it would be fantastic to be able
>> to limit the request volume per bucket. In Amazon S3, they provide
>> roughly 300 PUT/LIST/DELETE per second or 800 GET per second. Taking
>> those values and translating them into sensible default weight seems
>> like a good start. The ability to scale the limits as the bucket is
>> sharded would further enhance fidelity with Amazon's behavior. When
>
> Okay, I could see that working with two request classes for each
> bucket instead of just data+metadata. I'm not sure how well the
> priority queue itself will handle a large number of different clients,
> but I could do some microbenchmarks to see.
>
> Aside from the ability to set limits, dmclock also supports
> reservations and weighting for fairness. Do you think those features
> are as interesting as the limits, on a per-bucket dimension?
>
> If not, maybe per-bucket limits (and per-user, as Robin points out
> later in the thread) would work better as separate steps underneath
> the dmclock priority queue. Done separately, it would be easier to
> optimize for a large number of buckets/users, and could support
> configuring different limits for specific buckets/users.
I guess this might be the way to go, this should also help prevent the
case when a single user is using up too much of cluster's resources,
other than the many writes to a single bucket problem, and the ability
to configure this at a user/bucket level would be a huge win for cluster
administrators who know their workloads and can whitelist blacklist some
preferred workloads.
> On the other hand, if reservations and weighting between buckets is
> important, should those values scale with shards as well? As the
> number of bucket shards in the cluster grows, the other non-bucket
> request classes (admin and auth) would get a smaller proportion and
> need adjusting.
>
>> you exceed the number of requests per second in Amazon, you get a 503:
>> "Slow down" error, we should probably do similar. All these things go
>
> Agreed! It makes sense to return 503 once you reach the limit, instead
> of queuing. The dmclock priority queue doesn't support this now, but
> I'm guessing that it could be made to.

Swift clients also have a similar logic, though swift's dedicated error
code is a non standard 498; on which clients start doing a backoff.
Having said even returning a 503 would be sufficient enough for most
http clients to start doing some sort a backoff.
> That would mean civetweb could take advantage of this too, which would
> be wonderful.
>
>> a long way in protecting the system from being abused as a k/v store,
>> misguided tenants can't sap the seeks from folks who are using the
>> system for appropriately sized objects.
>>
>> https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
>> https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html
>> On Thu, Mar 22, 2018 at 3:09 PM, Abhishek <abhishek@xxxxxxxx> wrote:
>>> On 2018-03-22 22:17, Yehuda Sadeh-Weinraub wrote:
>>>>
>>>> On Thu, Mar 22, 2018 at 12:09 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>>>>>
>>>>> One of the benefits of the asynchronous beast frontend in radosgw is that
>>>>> it
>>>>> allows us to do things like request throttling and priority queuing that
>>>>> would otherwise block frontend threads - which are a scarce resource in
>>>>> civetweb's thread-per-connection model.
>>>>>
>>>>> The primary goal of this project is to prevent large object data
>>>>> workloads
>>>>> from starving out cheaper requests. After some discussion in the Ann
>>>>> Arbor
>>>>> office, our resident dmclock expert Eric Ivancich convinced us that
>>>>> mclock
>>>>> was a good fit. I've spent the week exploring a design for this, and
>>>>> wanted
>>>>> to get some early feedback:
>>>>>
>>>>> Each HTTP request would be assigned a request class (dmclock calls them
>>>>> clients) and a cost.
>>>>>
>>>>> The four initial request classes:
>>>>> - auth: requests for swift auth tokens, and eventually sts
>>>>> - admin: admin APIs for use by the dashboard and multisite sync
>>>>> - data: object io
>>>>> - metadata: everything else, such as bucket operations, object stat, etc.
>>>>>
>>>>> Calculating a cost is difficult, especially for the two major cases where
>>>>> we'd want it: object GET requests (because we have to check with RADOS
>>>>> before we know its actual size), and object PUT requests that use chunked
>>>>> transfer-encoding. I'd love to hear ideas for this, but for now I think
>>>>> it's
>>>>> good enough to assign everything a cost of 1 so that all of the units are
>>>>> in
>>>>> requests/sec. I believe this is what the osd is doing now as well?
>>>>>
>>>>
>>>> That does sound like the simpler solution that should be good enough
>>>> starting point. What if we could integrate it in a much lower layer,
>>>> e.g., into librados?
>>>>
>>>>> New virtual functions in class RGWOp seem like a good way for the derived
>>>>> Ops to return their request class and cost. Once we know those, we can
>>>>> add
>>>>> ourselves to the mclock priority queue and do an async wait until its our
>>>>> turn to run.
>>>>>
>>>>> But where exactly does this step fit into the request processing
>>>>> pipeline?
>>>>> Does it happen before or after authentication/authorization? I'm leaning
>>>>> towards after, so that auth failures get filtered out before they enter
>>>>> the
>>>>> queue.
>>>>
>>>>
>>>> What about the situation where you have a bad actor flooding with
>>>> badly authenticated requests?
>>>
>>>
>>> For non admin requests, maybe we could use the user parameter to
>>> start increasing the cost associated with the user as more requests start to
>>> pile up (though this isn't strictly affected by before/after authentication
>>> as we
>>> populate the user info before that anyway)
>>>
>>>>>
>>>>> The priority queue can use perf counters for introspection, and a config
>>>>> observer to apply changes to the per-client mclock options.
>>>>>
>>>>> As future work, we could add some load balancer integration to:
>>>>> - enable custom scripts that look at incoming requests and assign their
>>>>> own
>>>>> request class/cost
>>>>> - track distributed client stats across gateways, and feed that info back
>>>>> into radosgw with each request (this is the d in dmclock)
>>>>>
>>>>> Thanks,
>>>>> Casey
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html