On Wed, Oct 14, 2020 at 04:31:49PM +0800, JeffleXu wrote: > What about just disabling HIPRI in preadv2(2)/pwritev2(2)? Christoph Hellwig > disabled HIPRI for libaio in Then people will complain that poll can't be used for sync dio, and it is an regression. > > commit 154989e45fd8de9bfb52bbd6e5ea763e437e54c5 ("aio: clear IOCB_HIPRI"). > What do you think, @Christoph? > > (cc Christoph Hellwig) > > > > static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) > > { > > - bio->bi_opf |= REQ_HIPRI; > > - if (!is_sync_kiocb(kiocb)) > > - bio->bi_opf |= REQ_NOWAIT; > > + bio->bi_opf |= REQ_HIPRI | REQ_NOWAIT; > > } > > The original patch indeed could not fix the problem. Though it could fix the > potential deadlock, > > the VFS code read(2)/write(2) is not ready by handling the returned -EAGAIN > gracefully. Currently > > read(2)/write(2) will just return -EAGAIN to user space. > > > > On 10/13/20 8:09 PM, Ming Lei wrote: > > On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote: > > > Sync polling also needs REQ_NOWAIT flag. One sync read/write may be > > > split into several bios (and thus several requests), and can used up the > > > queue depth sometimes. Thus the following bio in the same sync > > > read/write will wait for usable request if REQ_NOWAIT flag not set, in > > > which case the following sync polling will cause a deadlock. > > > > > > One case (maybe the only case) for above situation is preadv2/pwritev2 > > > + direct + highpri. Two conditions need to be satisfied to trigger the > > > deadlock. > > > > > > 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2) > > > and corresponding write family syscalls don't support high-priority IO and > > > thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports > > > high-priority IO by RWF_HIPRI flag of @flags parameter. > > > > > > 2. Polling support in sync routine. Currently both the blkdev and > > > iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The > > > general routine is described as follows. > > > > > > submit_bio > > > wait for blk_mq_get_tag(), waiting for requests completion, which > > > should be done by the following polling, thus causing a deadlock. > > Another blocking point is rq_qos_throttle(), > > What is the issue here in rq_qos_throttle()? More details? Like allocating request, rq_qos_throttle() may wait until rq completion is done, see wbt_wait() and wbt_done(). > > > > so I guess falling back to > > REQ_NOWAIT may not fix the issue completely. > > > > > Given iopoll isn't supposed to in case of big IO, another solution > > may be to disable iopoll when bio splitting is needed, something > > like the following change: > > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > > index bcf5e4580603..8e762215660b 100644 > > --- a/block/blk-merge.c > > +++ b/block/blk-merge.c > > @@ -279,6 +279,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, > > return NULL; > > split: > > *segs = nsegs; > > + > > + /* > > + * bio splitting may cause more trouble for iopoll which isn't supposed > > + * to be used in case of big IO > > + */ > > + bio->bi_opf &= ~REQ_HIPRI; > > return bio_split(bio, sectors, GFP_NOIO, bs); > > } > > Actually split is not only from blk_mq_submit_bio->__blk_queue_split. In > __blkdev_direct_IO, > > one input iov_iter could be split to several bios. > > ``` > > __blkdev_direct_IO: > > for (;;) { > ret = bio_iov_iter_get_pages(bio, iter); > submit_bio(bio); > } > > for (;;) { > blk_poll() > > ... > > } > > ``` > > Since one single bio can contain at most BIO_MAX_PAGES, i.e. 256 bio_vec in > @bio->bi_io_vec, As Jens mentioned, it is weird to poll on multiple bios, so we can disable io poll simply in __blkdev_direct_IO(). And it is reasonable from user's viewpoint to not poll on big bio cause io poll is often used in latency sensitive cases. > > if the @iovcnt parameter of preadv2(2)/pwritev2(2) is larger than 256, then > one call of > > preadv2(2)/pwritev2(2) can be split into several bios. These bios are > submitted at once, and then > > start sync polling in the process context. > > > If the number of bios split from one call of preadv2(2)/pwritev2(2) is > larger than the queue depth, > > bios from single preadv2(2)/pwritev2(2) call can exhaust the queue depth and > thus cause deadlock. > > Fortunately the maximum of @iovcnt parameter of preadv2(2)/pwritev2(2) is > UIO_MAXIOV, i.e. 1024, > > and the minimum of queue depth is BLKDEV_MIN_RQ i.e. 4. That means one > preadv2(2)/pwritev2(2) > > call can submit at most 4 bios, which will fill up the queue depth exactly > and there's no deadlock in this > > case. I'm not sure if the setting of UIO_MAXIOV/BIO_MAX_PAGES/BLKDEV_MIN_RQ > is coincident or > > deliberately tuned. At least it will not cause deadlock currently , though > the constraint may be a little fragile. > > > By the way, this patch could fix the potential hang I mentioned in > > https://patchwork.kernel.org/project/linux-block/patch/20200911032958.125068-1-jefflexu@xxxxxxxxxxxxxxxxx/ Right, I remembered that race. Thanks, Ming