Dear list,
I'm planning to do some refactoring in the RGWPutObjProcessor stack as
part of the async request processing project, which involves replacing
any blocking waits on AioCompletion::wait_for_safe() with ones that
suspend/resume the coroutine from the beast frontend.
Most of this blocking happens in throttle_data() down in
RGWPutObjProcessor_Aio, which is called after each buffer is passed to
handle_data(). If handle_data() results in a write to rados, a 'void
*handle' for the AioCompletion is returned by handle_data(), then passed
back to throttle_data() where it's registered as 'pending' and waited on
if necessary. See put_data_and_throttle() in rgw_op.h [1] for a
canonical example of this handle_data()-throttle_data() loop, which is
duplicated in several other places.
This control flow became a bit more convoluted with the addition of the
PutObj filters (to support compression in jewel, and encryption in
luminous) which are stacked on top of the PutObjProcessor. Now this
AioCompletion handle is being passed all the way up the stack, and then
back down for throttle_data().
This model has several issues:
* If one call to a filter's handle_data() function generates multiple
rados writes, only the final one will be returned all the way up the
stack and passed back to throttle_data(). This is generally avoided with
the 'bool *again' flag but it requires the application logic, ie
put_data_and_throttle(), to keep passing the same buffer through
handle_data()/throttle_data() until again==false.
* Where compression is involved, the application is dealing with
uncompressed buffer sizes, but we want to throttle based on the
compressed size of rados writes instead.
* Throttling is based on the size of the last bufferlist passed to
handle_data() at the top of the stack. Some filters do internal
buffering, and RGWPutObjProcessor_Atomic itself will buffer up data from
multiple calls until we have rgw_max_chunk_size to write at once. So the
final call to handle_data() may be much smaller, yet that's the size
argument passed to throttle_data(). [2]
On the other hand, one potential advantage of this model is that the
application can do some extra work between the calls handle_data() and
throttle_data() for a small benefit to parallelism. The only thing that
currently does this is fetch_remote_obj() for opstate tracking, but I
believe this is obsolete and have a pr [3] to remove it.
I'd like to propose that we invert this control flow so that
throttle_data() is called by RGWPutObjProcessor_Aio immediately after
submitting each aio write to rados. That way any blocking happens at the
bottom of the stack before returning. Not only does that address the
issues listed above, but it also prevents the AioCompletion-based
implementation details from leaking into these interfaces. That in turn
will make it easier to plug in a different strategy for use with beast,
which will likely combine AioCompletion callbacks with asio-style
asynchronous waits. And if another case arises where we want to perform
some extra work before throttling, we could accomplish that by passing
some kind of callback interface into RGWPutObjProcessor_Aio.
Any feedback/objections/alternatives?
Thanks,
Casey
[1]
https://github.com/ceph/ceph/blob/e03d228ab08049ba3b7fc64533d299868640cf17/src/rgw/rgw_op.h#L1859-L1886
[2] http://tracker.ceph.com/issues/24594
[3] https://github.com/ceph/ceph/pull/24059