On Wed, Apr 13, 2022 at 09:51:13AM -0700, Christoph Hellwig wrote: > On Wed, Apr 13, 2022 at 04:48:05PM +0800, Ming Lei wrote: > > If the bio is marked as POLLED after submission, > > Umm, a bio should never be marked POLLED after submission. I meant its POLLED flag is still kept after submission since either bio split or the submission check code may clear it. > > > bio_poll() should be > > called for reaping io since there isn't irq for completing the request, > > so we can't call into blk_io_schedule() in case that bio_poll() returns > > zero. > > > > Also for calling bio_poll(), current->plug has to be flushed out, > > otherwise bio may not be issued to driver, and ->bi_cookie won't > > be set. > > I think we just need to bypass the plug to start with for synchronous > polled I/O. That should work, something like the following. But I guess it may hurt performance on io_uring workload, which does flush plug explicitly. diff --git a/block/blk-merge.c b/block/blk-merge.c index 7771dacc99cb..f15dee40ed02 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -1040,6 +1040,9 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, struct blk_plug *plug; struct request *rq; + if (blk_bypass_plug(bio)) + return false; + plug = blk_mq_plug(q, bio); if (!plug || rq_list_empty(plug->mq_list)) return false; diff --git a/block/blk-mq.c b/block/blk-mq.c index ed3ed86f7dd2..aa6b1e6b6d8c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2851,7 +2851,7 @@ void blk_mq_submit_bio(struct bio *bio) return; } - if (plug) + if (plug && !blk_bypass_plug(bio)) blk_add_rq_to_plug(plug, rq); else if ((rq->rq_flags & RQF_ELV) || (rq->mq_hctx->dispatch_busy && diff --git a/block/blk.h b/block/blk.h index 8ccbc6e07636..65b3e0bac322 100644 --- a/block/blk.h +++ b/block/blk.h @@ -507,4 +507,12 @@ static inline int req_ref_read(struct request *req) return atomic_read(&req->ref); } +static inline bool blk_bypass_plug(struct bio *bio) +{ + if (op_is_sync(bio->bi_opf) && (bio->bi_opf & REQ_POLLED)) + return true; + + return false; +} + #endif /* BLK_INTERNAL_H */ > > Do you have a reproducer? We'd also need to sort all this out for > polled file system I/O as well. Yeah, we should cover swap_readpage()/__iomap_dio_rw(), and the issue can be reproduced pretty easily, so not sure if there is real users of sync polled dio, and the issue is found by Changhui in RH lab test. fio --bs=4k --ioengine=pvsync2 --norandommap --hipri=1 \ --filename=$DEV --direct=1 --runtime=10 --numjobs=1 --rw=rw \ --name=test --group_reporting $DEV can be nvme/null_blk, both can be triggered reliably & easily. I'd suggest to fix it in __blkdev_direct_IO_simple()/swap_readpage()/__iomap_dio_rw() for avoiding to hurt io_uring. Thanks, Ming