We use a sort of message-passing arrangement where a worker thread is
responsible for updating certain data structures as needed for the I/Os
in progress, rather than having the processing of each I/O contend for
locks on the data structures. It gives us some good throughput under load but it does mean upwards of a dozen handoffs per 4kB write, depending on compressibility, whether the block is a duplicate, and various other factors. So processing 1 GB/s means handling over 3M messages per second, though each step of processing is generally lightweight.
There seems a natural duality between
work items passing between threads, each exclusively owning a structure,
vs structures passing between threads, each exclusively owning a work
item. In the first, the threads are grabbing a notional 'lock' on each
item in turn to deal with their structure, as VDO does now; in the
second, the threads are grabbing locks on each structure in turn to deal
with their item.
If kernel workqueues have higher overhead per item for the lightweight
work VDO currently does in each step, perhaps the dual of the current
scheme would let more work get done per fixed queuing overhead, and thus
perform better? VIOs could take locks on sections of structures, and
operate on multiple structures before requeueing.
This might also enable more finegrained locking of structures than the
chunks uniquely owned by threads at the moment. It would also be
attractive to let the the kernel work queues deal with concurrency
management instead of configuring the number of threads for each of a
bunch of different structures at start time.
On the other hand, I played around with switching messagepassing to
structurelocking in VDO a number of years ago for fun on the side, just
extremely naively replacing each message passing with releasing a mutex
on the current set of structures and (trying to) take a mutex on the
next set of structures, and ran into some complexity around certain
ordering requirements. I think they were around recovery journal entries
going into the slab journal and the block map in the same order; and
also around the use of different priorities for some different items. I
don't have that code anymore, unfortunately, so I don't know how hard it
would be to try that experiment again.
Sweet Tea