On Tue, Mar 18, 2025 at 04:31:35PM +0100, Mikulas Patocka wrote: > > > On Tue, 18 Mar 2025, Ming Lei wrote: > > > On Tue, Mar 18, 2025 at 03:26:10PM +0100, Mikulas Patocka wrote: > > > The block limits may be read while they are being modified. The statement > > > > It is supposed to not be so for IO path, that is why queue is usually down > > or frozen when updating limit. > > The limits are read at some points when constructing a bio - for example > bio_integrity_add_page, bvec_try_merge_hw_page, bio_integrity_map_user. For request based code path, there isn't such issue because queue usage counter is grabbed. I should be one device mapper specific issue because the above interface may not be called from dm_submit_bio(). One fix is to make sure that queue usage counter is grabbed in dm's bio/clone submission code path. > > > For other cases, limit lock can be held for sync the read/write. > > > > Or you have cases not covered by both queue freeze and limit lock? > > For example, device mapper reads the limits of the underlying devices > without holding any lock (dm_set_device_limits, dm_set_device_limits() need to be fixed by holding limit lock. > __process_abnormal_io, > __max_io_len). The two is called with queue usage counter grabbed, so it should be fine. > It also writes the limits in the I/O path - > disable_discard, disable_write_zeroes - you couldn't easily lock it here > because it happens in the interrupt contex. IMO it is one bad implementation, why does device mapper have to clear it in bio->end_io() or request's blk_mq_ops->complete()? > > I'm not sure how many other kernel subsystems do it and whether they could > all be converted to locking. Most request based driver should have been converted to new API. I guess only device mapper / raid / other bio based driver should have such kind of risk. > > > > "q->limits = *lim" is not really atomic. The compiler may turn it into > > > memcpy (clang does). > > > > > > On x86-64, the kernel uses the "rep movsb" instruction for memcpy - it is > > > optimized on modern CPUs, but it is not atomic, it may be interrupted at > > > any byte boundary - and if it is interrupted, the readers may read > > > garbage. > > > > > > On sparc64, there's an instruction that zeroes a cache line without > > > reading it from memory. The kernel memcpy implementation uses it (see > > > b3a04ed507bf) to avoid loading the destination buffer from memory. The > > > problem is that if we copy a block of data to q->limits and someone reads > > > it at the same time, the reader may read zeros. > > > > > > This commit changes it to use WRITE_ONCE, so that individual words are > > > updated atomically. > > > > It isn't necessary, for this particular problem, it is also fragile to > > provide atomic word update in this low level way, such as, what if > > sizeof(struct queue_limits) isn't 8byte aligned? > > struct queue_limits contains two "unsigned long" fields, so it must be > aligned on "unsigned long" boundary. > > In order to make it future-proof, we could use __alignof__(struct > queue_limits) to determine the size of the update step. Yeah, it looks fine, but I feel it is still fragile, and not sure it is one accepted solution. Thanks, Ming