Re: [PATC] block: update queue limits atomically

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 19 Mar 2025 09:22:24 +0800

On Tue, Mar 18, 2025 at 04:31:35PM +0100, Mikulas Patocka wrote:
> 
> 
> On Tue, 18 Mar 2025, Ming Lei wrote:
> 
> > On Tue, Mar 18, 2025 at 03:26:10PM +0100, Mikulas Patocka wrote:
> > > The block limits may be read while they are being modified. The statement
> > 
> > It is supposed to not be so for IO path, that is why queue is usually down
> > or frozen when updating limit.
> 
> The limits are read at some points when constructing a bio - for example 
> bio_integrity_add_page, bvec_try_merge_hw_page, bio_integrity_map_user.

For request based code path, there isn't such issue because queue usage
counter is grabbed.

I should be one device mapper specific issue because the above interface
may not be called from dm_submit_bio().

One fix is to make sure that queue usage counter is grabbed in dm's bio/clone
submission code path.

> 
> > For other cases, limit lock can be held for sync the read/write.
> > 
> > Or you have cases not covered by both queue freeze and limit lock?
> 
> For example, device mapper reads the limits of the underlying devices 
> without holding any lock (dm_set_device_limits,

dm_set_device_limits() need to be fixed by holding limit lock.

> __process_abnormal_io, 
> __max_io_len).

The two is called with queue usage counter grabbed, so it should be fine.

> It also writes the limits in the I/O path - 
> disable_discard, disable_write_zeroes - you couldn't easily lock it here 
> because it happens in the interrupt contex.

IMO it is one bad implementation, why does device mapper have to clear
it in bio->end_io() or request's blk_mq_ops->complete()?

> 
> I'm not sure how many other kernel subsystems do it and whether they could 
> all be converted to locking.

Most request based driver should have been converted to new API.

I guess only device mapper / raid / other bio based driver should have such
kind of risk.

> 
> > > "q->limits = *lim" is not really atomic. The compiler may turn it into
> > > memcpy (clang does).
> > > 
> > > On x86-64, the kernel uses the "rep movsb" instruction for memcpy - it is
> > > optimized on modern CPUs, but it is not atomic, it may be interrupted at
> > > any byte boundary - and if it is interrupted, the readers may read
> > > garbage.
> > > 
> > > On sparc64, there's an instruction that zeroes a cache line without
> > > reading it from memory. The kernel memcpy implementation uses it (see
> > > b3a04ed507bf) to avoid loading the destination buffer from memory. The
> > > problem is that if we copy a block of data to q->limits and someone reads
> > > it at the same time, the reader may read zeros.
> > > 
> > > This commit changes it to use WRITE_ONCE, so that individual words are
> > > updated atomically.
> > 
> > It isn't necessary, for this particular problem, it is also fragile to
> > provide atomic word update in this low level way, such as, what if
> > sizeof(struct queue_limits) isn't 8byte aligned?
> 
> struct queue_limits contains two "unsigned long" fields, so it must be 
> aligned on "unsigned long" boundary.
> 
> In order to make it future-proof, we could use __alignof__(struct 
> queue_limits) to determine the size of the update step.

Yeah, it looks fine, but I feel it is still fragile, and not sure it is one
accepted solution.

Thanks,
Ming