On Thu, Jul 15, 2021 at 09:01:48AM +0200, Christoph Hellwig wrote: > On Thu, Jul 01, 2021 at 07:35:37PM +0800, Hou Tao wrote: > > When running stress test on null_blk under linux-4.19.y, the following > > warning is reported: > > > > percpu_ref_switch_to_atomic_rcu: percpu ref (css_release) <= 0 (-3) after switching to atomic > > > > The cause is that css_put() is invoked twice on the same bio as shown below: > > > > CPU 1: CPU 2: > > > > // IO completion kworker // IO submit thread > > __blkdev_direct_IO_simple > > submit_bio > > > > bio_endio > > bio_uninit(bio) > > css_put(bi_css) > > bi_css = NULL > > set_current_state(TASK_UNINTERRUPTIBLE) > > bio->bi_end_io > > blkdev_bio_end_io_simple > > bio->bi_private = NULL > > // bi_private is NULL > > READ_ONCE(bio->bi_private) > > wake_up_process > > smp_mb__after_spinlock > > > > bio_unint(bio) > > // read bi_css as no-NULL > > // so call css_put() again > > css_put(bi_css) > > > > Because there is no memory barriers between the reading and the writing of > > bi_private and bi_css, so reading bi_private as NULL can not guarantee > > bi_css will also be NULL on weak-memory model host (e.g, ARM64). > > > > For the latest kernel source, css_put() has been removed from bio_unint(), > > but the memory-order problem still exists, because the order between > > bio->bi_private and {bi_status|bi_blkg} is also assumed in > > __blkdev_direct_IO_simple(). It is reproducible that > > __blkdev_direct_IO_simple() may read bi_status as 0 event if > > bi_status is set as an errno in req_bio_endio(). > > > > In __blkdev_direct_IO(), the memory order between dio->waiter and > > dio->bio.bi_status is not guaranteed neither. Until now it is unable to > > reproduce it, maybe because dio->waiter and dio->bio.bi_status are > > in the same cache-line. But it is better to add guarantee for memory > > order. Cachelines don't guarantee anything, you can get partial forwards. > > Fixing it by using smp_load_acquire() & smp_store_release() to guarantee > > the order between {bio->bi_private|dio->waiter} and {bi_status|bi_blkg}. > > > > Fixes: 189ce2b9dcc3 ("block: fast-path for small and simple direct I/O requests") > > This obviously does not look broken, but smp_load_acquire / > smp_store_release is way beyond my paygrade. Adding some CCs. This block stuff is a bit beyond me, lets see if we can make sense of it. > > Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > > --- > > fs/block_dev.c | 19 +++++++++++++++---- > > 1 file changed, 15 insertions(+), 4 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index eb34f5c357cf..a602c6315b0b 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -224,7 +224,11 @@ static void blkdev_bio_end_io_simple(struct bio *bio) > > { > > struct task_struct *waiter = bio->bi_private; > > > > - WRITE_ONCE(bio->bi_private, NULL); > > + /* > > + * Paired with smp_load_acquire in __blkdev_direct_IO_simple() > > + * to ensure the order between bi_private and bi_xxx > > + */ This comment doesn't help me; where are the other stores? Presumably somewhere before this is called, but how does one go about finding them? The Changelog seems to suggest you only care about bi_css, not bi_xxx in general. In specific you can only care about stores that happen before this; is all of bi_xxx written before here? If not, you have to be more specific. Also, this being a clear, this very much isn't the typical publish pattern. On top of all that, smp_wmb() would be sufficient here and would be cheaper on some platforms (ARM comes to mind). > > + smp_store_release(&bio->bi_private, NULL); > > blk_wake_io_task(waiter); > > } > > > > @@ -283,7 +287,8 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > > qc = submit_bio(&bio); > > for (;;) { > > set_current_state(TASK_UNINTERRUPTIBLE); > > - if (!READ_ONCE(bio.bi_private)) > > + /* Refer to comments in blkdev_bio_end_io_simple() */ > > + if (!smp_load_acquire(&bio.bi_private)) > > break; > > if (!(iocb->ki_flags & IOCB_HIPRI) || > > !blk_poll(bdev_get_queue(bdev), qc, true)) That comment there doesn't help me find any relevant later loads and is thus again inadequate. Here the purpose seems to be to ensure the bi_css load happens after the bi_private load, and this again is cheaper done using smp_rmb(). Also, the implication seems to be -- but is not spelled out anywhere -- that if bi_private is !NULL, it is stable. > > @@ -353,7 +358,12 @@ static void blkdev_bio_end_io(struct bio *bio) > > } else { > > struct task_struct *waiter = dio->waiter; > > > > - WRITE_ONCE(dio->waiter, NULL); > > + /* > > + * Paired with smp_load_acquire() in > > + * __blkdev_direct_IO() to ensure the order between > > + * dio->waiter and bio->bi_xxx > > + */ > > + smp_store_release(&dio->waiter, NULL); > > blk_wake_io_task(waiter); > > } > > } > > @@ -478,7 +488,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, > > > > for (;;) { > > set_current_state(TASK_UNINTERRUPTIBLE); > > - if (!READ_ONCE(dio->waiter)) > > + /* Refer to comments in blkdev_bio_end_io */ > > + if (!smp_load_acquire(&dio->waiter)) > > break; > > > > if (!(iocb->ki_flags & IOCB_HIPRI) || Idem for these...