Re: [PATCH 00/20] rbd: support for object-map and fast-diff

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 3 Jul 2019 10:12:22 +0200

On Sun, Jun 30, 2019 at 3:55 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
>
>
>
> Hi Ilya,
>
> Nice work. Some comments/questions:
>
> 1) Generally having a state machine makes things easier to track as the
> code is less dispersed than before.
>
> 2) The running_list is used to keep track of inflight requests in case
> of exclusive lock to support rbd_quiesce_lock() when releasing the lock.
> It would be great to generalize this list to keep track of all inflight
> requests even in the case when lock is not required, it could be used to
> support a generic flush (similar to librbd work queue flush/drain).
> Having a generic inflight requests list will also make it easier to
> support other functions such as task aborts, request timeouts, flushing
> on pre-snapshots.

Hi Maged,

We could extend it in the future, but right now it is there just for
exclusive lock.  Likely, we would need a different list for tracking
all in-flight requests because if we start adding random requests
(e.g. those that don't require exclusive lock), deadlocks will occur.

>
> 3) For the acquiring_list, my concern is that while the lock is pending
> to be acquired, the requests are being accepted without a limit. In case
> there is a delay acquiring the lock, for example if the primary of the
> object header is down (which could block for ~ 25 sec) or worse if the
> pool is inactive, the count could well exceed the max queue depth + for
> write requests this can consume a lot of memory.

As you noted in a different email, it is limited by queue_depth.  No
different from regular I/O taking too long.

>
> 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task for
> every request. I understand this is a single threaded queue and if lock
> is acquired then all acquire tasks are cancelled, however i feel the
> queue could fill a lot. Any chance we can only schedule 1 acquire task ?

This is not new, it's done in existing code too.  queue_delayed_work()
bails if the work is already queued, so ->lock_dwork is never queued
more than once.

>
> 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image
> does not require exclusive lock + when lock is acquired. Cosmetically it
> may be better to separate them.

Not all requests require exclusive lock even if exclusive-lock feature
is enabled (e.g. reads when object-map feature is disabled).  We switch
from RBD_IMG_EXCLUSIVE_LOCK early (i.e. no waiting) in those cases and
the case of exclusive-lock feature being disabled just falls into that
basket.

>
> 6) Probably not an issue, but we are now creating a large number of
> mutexes, at least 2 for every request. Maybe we need to test in high
> iops/queue depths that there is no overhead for this.

Yes, that would be useful.  If you do that, please share your results!

We can get that down to one by reusing image requests with the help of
blk-mq.  We would still need to allocate them for accessing the parent
image, but for regular image requests it should be pretty easy to do.

Reusing object requests is going to be harder, but we could probably
preallocate one per image request, as that is by far the most common
case.

>
> 7) Is there any consideration to split the rbd module to multiple files
> ? Looking at how big librbd, fitting this in a single kernel file is
> challenging at best.

Dongsheng expressed this concern is well.  Personally, I don't mind
a large file as long as it is nicely structured.  Adding object-map and
fast-diff added just two forward declarations, one of which could be
avoided.

We are never going to fit the entire librbd into the kernel, no matter
how many files ;)  Breaking up into multiple files complicates blaming,
so I would rather keep it as is at this point.

Thanks,

                Ilya