Re: [PATCH 01/20] block, blk_filter: enable block device filters

Sergei Shtepa <sergei.shtepa@xxxxxxxxx> · Fri, 8 Jul 2022 12:45:33 +0200

On 7/7/22 19:26, Christoph Hellwig wrote:
> 
> On Thu, Jul 07, 2022 at 10:26:55AM +0200, Sergei Shtepa wrote:
>> Thank you, Christoph, for your attention to the patch.
>>
>> I am preparing the next version of the patch. In it, I planned to
>> simplify the bdev_filer code.
>> I will make changes in it, in accordance with your comments, and
>> will add your code and check it on my test labs.
>>
>> But I'm not sure if using the blk_mq_freeze_queue() is appropriate.
>> If I understood the code correctly, it is based on the expectation
>> that the counter q->q_usage_counter will decrease to zero.
>> To increase it, a blk_queue_enter() is used. And at the time of
>> calling the filter_bio() in the submit_bio_noacct(), this counter
>> has not yet been increased. I will double check this and try to
>> get rid of the bdev->bd_filter_lock.
> Indeed.  For this to work we'd need to call the filter driver
> later.  Which is brings up another question:  Is there a real
> need to attach the filter driver to the bdev and thus potentially
> partition?  The rest of the block layer operates on the whole disk
> after the intial partition remapping, and besides allowing the
> filter driver to be called under q_usage_counter, this would
> also clean up some concepts.  It would probably also allow to
> remove the repeat return value over just using submit_bio_noacct
> similar to how normal stacking drivers reinject bios.
> 

Thank you Christoph.
This is the most crucial question for the entire patch.
The filtering location sets restrictions for the filter code and
determines its main algorithm.

1. Work at the partition or disk level?
At the user level, programs operate with block devices.
In fact, the "disk" entity makes sense only for the kernel level. 
When the user chooses which block devices to backup and which not,
he operates with mounting points, which are converted into block
devices, partitions. Therefore, it is better to handle bio before
remapping to disk.
If the filtering is performed after remapping, then we will be
forced to apply a filter to the entire disk, or complicate the
filtering algorithm by calculating which range of sectors bio is
addressed to. And if bio is addressed to the partition boundary...
Filtering at the block device level seems to me a simpler solution.
But this is not the biggest problem.

2. Can the filter sleep or postpone bio processing to the worker thread?
The problem is in the implementation of the COW algorithm.
If I send a bio to read a chunk (one bio), and then pass a write bio,
then with some probability I am reading partially overwritten data.
Writing overtakes reading. And flags REQ_SYNC and REQ_PREFLUSH don't help.
Maybe it's a disk driver issue, or a hypervisor, or a NAS, or a RAID,
or maybe normal behavior. I don't know. Although, maybe I'm not working
correctly with flags. I have seen the comments on patch 11/20, but I am
not sure that the fixes will solve this problem.
But because of this, I have to postpone the write until the read completes.

2.1 The easiest way to solve the problem is to block the writer's thread
with a semaphore. And for bio with a flag REQ_NOWAIT, complete processing
with bio_wouldblock_error(). This is the solution currently being used.

2.2 Another solution is possible without putting the thread into a sleep
state, but with placing a write bio in a queue to another thread.
This solution is used in the veeamsnap out-of-tree module and it has
performance issues. I don't like. But when handling make_request_fn,
which was on kernels before 5.10, there was no choice.

The current implementation, when the filtering is performed before
remapping, allows to handle the bio to the partition, and allows to
switch the writer's thread to the sleep state.
I had to pay for it with a reference counter on the filter and a spinlock.
It may be possible to do better with RCU. I haven't tried it yet.

If I am blocked by the q->q_usage_counter counter, then I will not
be able to execute COW in the context of the current thread due to deadlocks.
I will have to use a scheme with an additional worker thread.
Bio filtering will become much more complicated.

>From an architectural point of view, I see the filter as an intermediate
layer between the file system and the block layer. If we lower the filter
deep into the block layer, then restrictions will be imposed on its use.