rgw: throttling logic in the PutObjProcessor stack

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 12 Sep 2018 11:38:38 -0400

Dear list,

I'm planning to do some refactoring in the RGWPutObjProcessor stack as 
part of the async request processing project, which involves replacing 
any blocking waits on AioCompletion::wait_for_safe() with ones that 
suspend/resume the coroutine from the beast frontend.

Most of this blocking happens in throttle_data() down in 
RGWPutObjProcessor_Aio, which is called after each buffer is passed to 
handle_data(). If handle_data() results in a write to rados, a 'void 
*handle' for the AioCompletion is returned by handle_data(), then passed 
back to throttle_data() where it's registered as 'pending' and waited on 
if necessary. See put_data_and_throttle() in rgw_op.h [1] for a 
canonical example of this handle_data()-throttle_data() loop, which is 
duplicated in several other places.

This control flow became a bit more convoluted with the addition of the 
PutObj filters (to support compression in jewel, and encryption in 
luminous) which are stacked on top of the PutObjProcessor. Now this 
AioCompletion handle is being passed all the way up the stack, and then 
back down for throttle_data().

This model has several issues:

* If one call to a filter's handle_data() function generates multiple 
rados writes, only the final one will be returned all the way up the 
stack and passed back to throttle_data(). This is generally avoided with 
the 'bool *again' flag but it requires the application logic, ie 
put_data_and_throttle(), to keep passing the same buffer through 
handle_data()/throttle_data() until again==false.
* Where compression is involved, the application is dealing with 
uncompressed buffer sizes, but we want to throttle based on the 
compressed size of rados writes instead.
* Throttling is based on the size of the last bufferlist passed to 
handle_data() at the top of the stack. Some filters do internal 
buffering, and RGWPutObjProcessor_Atomic itself will buffer up data from 
multiple calls until we have rgw_max_chunk_size to write at once. So the 
final call to handle_data() may be much smaller, yet that's the size 
argument passed to throttle_data(). [2]

On the other hand, one potential advantage of this model is that the 
application can do some extra work between the calls handle_data() and 
throttle_data() for a small benefit to parallelism. The only thing that 
currently does this is fetch_remote_obj() for opstate tracking, but I 
believe this is obsolete and have a pr [3] to remove it.

I'd like to propose that we invert this control flow so that 
throttle_data() is called by RGWPutObjProcessor_Aio immediately after 
submitting each aio write to rados. That way any blocking happens at the 
bottom of the stack before returning. Not only does that address the 
issues listed above, but it also prevents the AioCompletion-based 
implementation details from leaking into these interfaces. That in turn 
will make it easier to plug in a different strategy for use with beast, 
which will likely combine AioCompletion callbacks with asio-style 
asynchronous waits. And if another case arises where we want to perform 
some extra work before throttling, we could accomplish that by passing 
some kind of callback interface into RGWPutObjProcessor_Aio.

Any feedback/objections/alternatives?

Thanks,
Casey

[1] 
https://github.com/ceph/ceph/blob/e03d228ab08049ba3b7fc64533d299868640cf17/src/rgw/rgw_op.h#L1859-L1886
[2] http://tracker.ceph.com/issues/24594
[3] https://github.com/ceph/ceph/pull/24059