If kernel workqueues have higher overhead per item for the lightweight
work VDO currently does in each step, perhaps the dual of the current
scheme would let more work get done per fixed queuing overhead, and
thus perform better? VIOs could take locks on sections of structures,
and operate on multiple structures before requeueing.
Can you suggest a little more specifically what the "dual" is you're
picturing?
It sounds like your experiment consisted of one kernel workqueue per
existing thread, with VIOs queueing on each thread in turn precisely as
they do at present, so that when the VIO work item is running it's
guaranteed to be the unique actor on a particular set of structures
(e.g. for a physical thread the physical zone and slabs).
I am thinking of an alternate scheme where e.g. each slab, each block
map zone, each packer would be protected by a lock instead of owned by a
thread. There would be one workqueue with concurrency allowed where all
VIOs would operate.
VIOs would do an initial queuing on a kernel workqueue, and then when
the VIO work item would run, they'd take and hold the appropriate locks
while they operated on each structure. So they'd take and release slab
locks until they found a free block; send off to UDS and get requeued
when it came back or the timer expired; try to compress and take/release
a lock on the packer while adding itself to a bin and get requeued if
appropriate when the packer released it; write and requeue when the
write finishes if relevant. Then I think the 'make whatever modification
to structures is relevant' part can be done without any requeue: take
and release the recovery journal lock; ditto on the relevant slab; again
the journal; again the other slab; then the part of the block map; etc.
Yes, there's the intriguing ordering requirements to work through, but
maybe as an initial performance experiment the ordering can be ignored
to get an idea of whether this scheme could provide acceptable performance.
There are also occasionally non-VIO objects which get queued to invoke
actions on various threads, which I expect might further complicate the
experiment.
I think that's the easy part -- queueing a work item to grab a lock and
Do Something seems to me a pretty common thing in the kernel code.
Unless there are ordering requirements among the non-vios I'm not
calling to mind.