Casey Bodley <cbodley@xxxxxxxxxx> writes: > Hi Kyle, > > On Thu, Mar 22, 2018 at 7:40 PM, Kyle Bader <kyle.bader@xxxxxxxxx> wrote: >> From a capacity planning perspective, it would be fantastic to be able >> to limit the request volume per bucket. In Amazon S3, they provide >> roughly 300 PUT/LIST/DELETE per second or 800 GET per second. Taking >> those values and translating them into sensible default weight seems >> like a good start. The ability to scale the limits as the bucket is >> sharded would further enhance fidelity with Amazon's behavior. When > > Okay, I could see that working with two request classes for each > bucket instead of just data+metadata. I'm not sure how well the > priority queue itself will handle a large number of different clients, > but I could do some microbenchmarks to see. > > Aside from the ability to set limits, dmclock also supports > reservations and weighting for fairness. Do you think those features > are as interesting as the limits, on a per-bucket dimension? > > If not, maybe per-bucket limits (and per-user, as Robin points out > later in the thread) would work better as separate steps underneath > the dmclock priority queue. Done separately, it would be easier to > optimize for a large number of buckets/users, and could support > configuring different limits for specific buckets/users. I guess this might be the way to go, this should also help prevent the case when a single user is using up too much of cluster's resources, other than the many writes to a single bucket problem, and the ability to configure this at a user/bucket level would be a huge win for cluster administrators who know their workloads and can whitelist blacklist some preferred workloads. > On the other hand, if reservations and weighting between buckets is > important, should those values scale with shards as well? As the > number of bucket shards in the cluster grows, the other non-bucket > request classes (admin and auth) would get a smaller proportion and > need adjusting. > >> you exceed the number of requests per second in Amazon, you get a 503: >> "Slow down" error, we should probably do similar. All these things go > > Agreed! It makes sense to return 503 once you reach the limit, instead > of queuing. The dmclock priority queue doesn't support this now, but > I'm guessing that it could be made to. Swift clients also have a similar logic, though swift's dedicated error code is a non standard 498; on which clients start doing a backoff. Having said even returning a 503 would be sufficient enough for most http clients to start doing some sort a backoff. > That would mean civetweb could take advantage of this too, which would > be wonderful. > >> a long way in protecting the system from being abused as a k/v store, >> misguided tenants can't sap the seeks from folks who are using the >> system for appropriately sized objects. >> >> https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html >> https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html >> On Thu, Mar 22, 2018 at 3:09 PM, Abhishek <abhishek@xxxxxxxx> wrote: >>> On 2018-03-22 22:17, Yehuda Sadeh-Weinraub wrote: >>>> >>>> On Thu, Mar 22, 2018 at 12:09 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote: >>>>> >>>>> One of the benefits of the asynchronous beast frontend in radosgw is that >>>>> it >>>>> allows us to do things like request throttling and priority queuing that >>>>> would otherwise block frontend threads - which are a scarce resource in >>>>> civetweb's thread-per-connection model. >>>>> >>>>> The primary goal of this project is to prevent large object data >>>>> workloads >>>>> from starving out cheaper requests. After some discussion in the Ann >>>>> Arbor >>>>> office, our resident dmclock expert Eric Ivancich convinced us that >>>>> mclock >>>>> was a good fit. I've spent the week exploring a design for this, and >>>>> wanted >>>>> to get some early feedback: >>>>> >>>>> Each HTTP request would be assigned a request class (dmclock calls them >>>>> clients) and a cost. >>>>> >>>>> The four initial request classes: >>>>> - auth: requests for swift auth tokens, and eventually sts >>>>> - admin: admin APIs for use by the dashboard and multisite sync >>>>> - data: object io >>>>> - metadata: everything else, such as bucket operations, object stat, etc. >>>>> >>>>> Calculating a cost is difficult, especially for the two major cases where >>>>> we'd want it: object GET requests (because we have to check with RADOS >>>>> before we know its actual size), and object PUT requests that use chunked >>>>> transfer-encoding. I'd love to hear ideas for this, but for now I think >>>>> it's >>>>> good enough to assign everything a cost of 1 so that all of the units are >>>>> in >>>>> requests/sec. I believe this is what the osd is doing now as well? >>>>> >>>> >>>> That does sound like the simpler solution that should be good enough >>>> starting point. What if we could integrate it in a much lower layer, >>>> e.g., into librados? >>>> >>>>> New virtual functions in class RGWOp seem like a good way for the derived >>>>> Ops to return their request class and cost. Once we know those, we can >>>>> add >>>>> ourselves to the mclock priority queue and do an async wait until its our >>>>> turn to run. >>>>> >>>>> But where exactly does this step fit into the request processing >>>>> pipeline? >>>>> Does it happen before or after authentication/authorization? I'm leaning >>>>> towards after, so that auth failures get filtered out before they enter >>>>> the >>>>> queue. >>>> >>>> >>>> What about the situation where you have a bad actor flooding with >>>> badly authenticated requests? >>> >>> >>> For non admin requests, maybe we could use the user parameter to >>> start increasing the cost associated with the user as more requests start to >>> pile up (though this isn't strictly affected by before/after authentication >>> as we >>> populate the user info before that anyway) >>> >>>>> >>>>> The priority queue can use perf counters for introspection, and a config >>>>> observer to apply changes to the per-client mclock options. >>>>> >>>>> As future work, we could add some load balancer integration to: >>>>> - enable custom scripts that look at incoming requests and assign their >>>>> own >>>>> request class/cost >>>>> - track distributed client stats across gateways, and feed that info back >>>>> into radosgw with each request (this is the d in dmclock) >>>>> >>>>> Thanks, >>>>> Casey >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html