Re: [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target.

Mike Snitzer <snitzer@xxxxxxxxxx> · Thu, 27 Jul 2023 10:57:31 -0400

On Wed, Jul 26 2023 at  7:32P -0400,
Ken Raeburn <raeburn@xxxxxxxxxx> wrote:

> 
> An offline discussion suggested maybe I should've gone into a little
> more detail about how VDO uses its work queues.
> 
> VDO is sufficiently work-intensive that we found long ago that doing all
> the work in one thread wouldn't keep up.
> 
> Our multithreaded design started many years ago and grew out of our
> existing design for UDS (VDO's central deduplication index), which,
> somewhat akin to partitioning and sharding in databases, does scanning
> of the in-memory part of the "database" of values in some number (fixed
> at startup) of threads, with the data and work divided up based on
> certain bits of the hash value being looked up, and performs its I/O and
> callbacks from certain other threads. We aren't splitting work to
> multiple machines as database systems sometimes do, but to multiple
> threads and potentially multiple NUMA nodes.
> 
> We try to optimize for keeping the busy case fast, even if it means
> light usage loads don't perform quite as well as they could be made to.
> We try to reduce instances of contention between threads by avoiding
> locks when we can, preferring a fast queueing mechanism or loose
> synchronization between threads. (We haven't kept to it strictly, but
> we've mostly tried to.)
> 
> In VDO, at the first level, the work is split according to the
> collection of data structures to be updated (e.g., recovery journal vs
> disk block allocation vs block address mapping management).
> 
> For some data structures, we split the structures further based on
> values of relevant bit-strings for the data structure in question (block
> addresses, hash values). Currently we can split the work N ways for many
> small values of N but it's hard to change N without restarting. The
> processing of a read or write operation generally doesn't need to touch
> more than one "zone" in any of these sets (or two, in a certain write
> case).
> 
> Giving one thread exclusive access to the data structures means we can
> do away with the locking. Of course, with so many different threads
> owning data structures, we get a lot of queueing in exchange, but we
> depend on a fast, nearly-lock-free MPSC queueing mechanism to keep that
> reasonably efficient.
> 
> There's a little more to it in places where we need to preserve the
> order of processing of multiple VIOs in a couple different sections of
> the write path. So we do make some higher-level use of the fact that
> we're adding work to queues with certain behavior, and not just turning
> loose a bunch of threads to contend for a just-released mutex.
> 
> Some other bits of work like computing the hash value don't update any
> other data structures, and not only would be amenable to kernel
> workqueue conversion with concurrency greater than 1, but such a
> conversion might open up some interesting options, like hashing on the
> CPU or NUMA node where the data block is likely to reside in cache. But
> for now, using one work management mechanism has been easier than two.
> 
> The experiment I referred to in my earlier email with using kernel
> workqueues in VDO kept the same model of protecting data structures by
> making them exclusive to specific threads (or in this case,
> concurrency-1 workqueues) to serialize all access and using message
> passing; it didn't change everything over to using mutexes instead.
> 
> I hope some of this helps. I'm happy to answer further questions.
> 
> Ken
> 

Thanks for the extra context, but a _big_ elephant in the room for
this line of discussion is that: the Linux workqueue code has
basically always been only available for use by GPL'd code.  Given
VDO's historic non-GPL origins, it seems _to me_ that an alternative
to Linux's workqueues had to be created to allow VDO to drive its
work.  While understandable, I gave guidance 6 years ago that VDO
engineering should work to definitively reconcile if using Linux
workqueues viable now that VDO has been GPL'd.

But it appears there wasn't much in the way of serious effort put to
completely converting to using Linux workqueues.  That is a problem
because all of the work item strategy deployed by VDO is quite
bespoke.  I don't think the code lends itself to being properly
maintained by more than a 1 or 2 engineers (if we're lucky at this
point).

And while I appreciate that the prospect of _seriously_ converting
over to using Linux workqueues is itself destabilizing and challenging
effort: it seems that it needs to be done to legitimately position the
code to go upstream.

I would like to see a patch crafted that allows branching between the
use of Linux and VDO workqueues. Initially have a dm-vdo modparam
(e.g. use_vdo_wq or vice-versa: use_linux_wq).  And have a wrapping
interface and associated data struct(s) that can bridge between work
being driven/coordinated by either (depending on disposition of
modparam).

This work isn't trivial, I get that. But it serves to clearly showcase
shortcomings, areas for improvement, while pivoting to more standard
Linux interfaces that really should've been used from VDO's inception.

Is this work that you feel you could focus on with urgency?

Thanks,
Mike