On 11/26/20 12:04 AM, Christoph Hellwig wrote: > > On Wed, Nov 25, 2020 at 09:43:35PM +0800, Jeffle Xu wrote: >> iopoll is initially for small size, latency sensitive IO. It doesn't >> work well for big IO, especially when it needs to be split to multiple >> bios. In this case, the returned cookie of __submit_bio_noacct_mq() is >> indeed the cookie of the last split bio. The completion of *this* last >> split bio done by iopoll doesn't mean the whole original bio has >> completed. Callers of iopoll still need to wait for completion of other >> split bios. >> >> Besides bio splitting may cause more trouble for iopoll which isn't >> supposed to be used in case of big IO. >> >> iopoll for split bio may cause potential race if CPU migration happens >> during bio submission. Since the returned cookie is that of the last >> split bio, polling on the corresponding hardware queue doesn't help >> complete other split bios, if these split bios are enqueued into >> different hardware queues. Since interrupts are disabled for polling >> queues, the completion of these other split bios depends on timeout >> mechanism, thus causing a potential hang. >> >> iopoll for split bio may also cause hang for sync polling. Currently >> both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling >> in direct IO routine. These routines will submit bio without REQ_NOWAIT >> flag set, and then start sync polling in current process context. The >> process may hang in blk_mq_get_tag() if the submitted bio has to be >> split into multiple bios and can rapidly exhaust the queue depth. The >> process are waiting for the completion of the previously allocated >> requests, which should be reaped by the following polling, and thus >> causing a deadlock. >> >> To avoid these subtle trouble described above, just disable iopoll for >> split bio and return BLK_QC_T_NONE in this case. The side effect is that >> non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable >> since the returned cookie is never used for non-HIPRI IO. >> >> Suggested-by: Ming Lei <ming.lei@xxxxxxxxxx> >> Signed-off-by: Jeffle Xu <jefflexu@xxxxxxxxxxxxxxxxx> >> --- >> block/blk-merge.c | 8 ++++++++ >> block/blk-mq.c | 6 +++++- >> 2 files changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/block/blk-merge.c b/block/blk-merge.c >> index bcf5e4580603..8a2f1fb7bb16 100644 >> --- a/block/blk-merge.c >> +++ b/block/blk-merge.c >> @@ -279,6 +279,14 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, >> return NULL; >> split: >> *segs = nsegs; >> + >> + /* >> + * Bio splitting may cause subtle trouble such as hang when doing sync >> + * iopoll in direct IO routine. Given performance gain of iopoll for >> + * big IO can be trival, disable iopoll when split needed. >> + */ >> + bio->bi_opf &= ~REQ_HIPRI; >> + >> return bio_split(bio, sectors, GFP_NOIO, bs); >> } >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 55bcee5dc032..4acd702575cd 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2265,7 +2265,11 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio) >> blk_mq_sched_insert_request(rq, false, true, true); >> } >> >> - return cookie; >> + if (bio->bi_opf & REQ_HIPRI) >> + return cookie; >> + >> + return BLK_QC_T_NONE; > > This can cause a use after free if the I/O completed very quickly. I > think we need to check the flag after the split, but before submitting > the request to the hardware. > Oops... Thanks, I'll send a new version later. -- Thanks, Jeffle