Hi Peter, On 7/15/2021 4:13 PM, Peter Zijlstra wrote: > On Thu, Jul 15, 2021 at 09:01:48AM +0200, Christoph Hellwig wrote: >> On Thu, Jul 01, 2021 at 07:35:37PM +0800, Hou Tao wrote: >>> When running stress test on null_blk under linux-4.19.y, the following >>> warning is reported: >>> >>> percpu_ref_switch_to_atomic_rcu: percpu ref (css_release) <= 0 (-3) after switching to atomic >>> snip >>> In __blkdev_direct_IO(), the memory order between dio->waiter and >>> dio->bio.bi_status is not guaranteed neither. Until now it is unable to >>> reproduce it, maybe because dio->waiter and dio->bio.bi_status are >>> in the same cache-line. But it is better to add guarantee for memory >>> order. > Cachelines don't guarantee anything, you can get partial forwards. Could you please point me to any reference ? I can not google any memory order things by using "partial forwards". > >>> Fixing it by using smp_load_acquire() & smp_store_release() to guarantee >>> the order between {bio->bi_private|dio->waiter} and {bi_status|bi_blkg}. >>> >>> Fixes: 189ce2b9dcc3 ("block: fast-path for small and simple direct I/O requests") >> This obviously does not look broken, but smp_load_acquire / >> smp_store_release is way beyond my paygrade. Adding some CCs. > This block stuff is a bit beyond me, lets see if we can make sense of > it. > >>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >>> --- >>> fs/block_dev.c | 19 +++++++++++++++---- >>> 1 file changed, 15 insertions(+), 4 deletions(-) >>> >>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>> index eb34f5c357cf..a602c6315b0b 100644 >>> --- a/fs/block_dev.c >>> +++ b/fs/block_dev.c >>> @@ -224,7 +224,11 @@ static void blkdev_bio_end_io_simple(struct bio *bio) >>> { >>> struct task_struct *waiter = bio->bi_private; >>> >>> - WRITE_ONCE(bio->bi_private, NULL); >>> + /* >>> + * Paired with smp_load_acquire in __blkdev_direct_IO_simple() >>> + * to ensure the order between bi_private and bi_xxx >>> + */ > This comment doesn't help me; where are the other stores? Presumably > somewhere before this is called, but how does one go about finding them? Yes, the change log is vague and it will be corrected. The other stores happen in req_bio_endio() and its callees when the request completes. > The Changelog seems to suggest you only care about bi_css, not bi_xxx in > general. In specific you can only care about stores that happen before > this; is all of bi_xxx written before here? If not, you have to be more > specific. Actually we care about all bi_xxx which are written in req_bio_endio, and all writes to bi_xxx happen before blkdev_bio_end_io_simple(). Here I just try to use bi_status as one example. > Also, this being a clear, this very much isn't the typical publish > pattern. > > On top of all that, smp_wmb() would be sufficient here and would be > cheaper on some platforms (ARM comes to mind). Thanks for your knowledge, I will use smp_wmb() instead of smp_store_release(). > >>> + smp_store_release(&bio->bi_private, NULL); >>> blk_wake_io_task(waiter); >>> } >>> >>> @@ -283,7 +287,8 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, >>> qc = submit_bio(&bio); >>> for (;;) { >>> set_current_state(TASK_UNINTERRUPTIBLE); >>> - if (!READ_ONCE(bio.bi_private)) >>> + /* Refer to comments in blkdev_bio_end_io_simple() */ >>> + if (!smp_load_acquire(&bio.bi_private)) >>> break; >>> if (!(iocb->ki_flags & IOCB_HIPRI) || >>> !blk_poll(bdev_get_queue(bdev), qc, true)) > That comment there doesn't help me find any relevant later loads and is > thus again inadequate. > > Here the purpose seems to be to ensure the bi_css load happens after the > bi_private load, and this again is cheaper done using smp_rmb(). Yes and thanks again. > > Also, the implication seems to be -- but is not spelled out anywhere -- > that if bi_private is !NULL, it is stable. What is the meaning of "it is stable" ? Do you mean if bi_private is NULL, the values of bi_xxx should be ensured ? >>> @@ -353,7 +358,12 @@ static void blkdev_bio_end_io(struct bio *bio) >>> } else { >>> struct task_struct *waiter = dio->waiter; >>> >>> - WRITE_ONCE(dio->waiter, NULL); >>> + /* >>> + * Paired with smp_load_acquire() in >>> + * __blkdev_direct_IO() to ensure the order between >>> + * dio->waiter and bio->bi_xxx >>> + */ >>> + smp_store_release(&dio->waiter, NULL); >>> blk_wake_io_task(waiter); >>> } >>> } >>> @@ -478,7 +488,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, >>> >>> for (;;) { >>> set_current_state(TASK_UNINTERRUPTIBLE); >>> - if (!READ_ONCE(dio->waiter)) >>> + /* Refer to comments in blkdev_bio_end_io */ >>> + if (!smp_load_acquire(&dio->waiter)) >>> break; >>> >>> if (!(iocb->ki_flags & IOCB_HIPRI) || > Idem for these... > . Thanks Regards, Tao