On Fri, Jul 21, 2023 at 09:10:57PM -0600, Jens Axboe wrote: > On 7/21/23 3:43?PM, Dave Chinner wrote: > > On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote: > >> Polled IO is only allowed for conditions where task completion is safe > >> anyway, so we can always complete it inline. This cannot easily be > >> checked with a submission side flag, as the block layer may clear the > >> polled flag and turn it into a regular IO instead. Hence we need to > >> check this at completion time. If REQ_POLLED is still set, then we know > >> that this IO was successfully polled, and is completing in task context. > >> > >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > >> --- > >> fs/iomap/direct-io.c | 14 ++++++++++++-- > >> 1 file changed, 12 insertions(+), 2 deletions(-) > >> > >> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > >> index 9f97d0d03724..c3ea1839628f 100644 > >> --- a/fs/iomap/direct-io.c > >> +++ b/fs/iomap/direct-io.c > >> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio) > >> } > >> > >> /* > >> - * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline > >> + * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline. > >> + * Ditto for polled requests - if the flag is still at completion > >> + * time, then we know the request was actually polled and completion > >> + * is called from the task itself. This is why we need to check it > >> + * here rather than flag it at issue time. > >> */ > >> - if (dio->flags & IOMAP_DIO_INLINE_COMP) { > >> + if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) { > > > > This still smells wrong to me. Let me see if I can work out why... > > > > <spelunk!> > > > > When we set up the IO in iomap_dio_bio_iter(), we do this: > > > > /* > > * We can only poll for single bio I/Os. > > */ > > if (need_zeroout || > > ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) > > dio->iocb->ki_flags &= ~IOCB_HIPRI; > > > > The "need_zeroout" covers writes into unwritten regions that require > > conversion at IO completion, and the latter check is for writes > > extending EOF. i.e. this covers the cases where we have dirty > > metadata for this specific write and so may need transactions or > > journal/metadata IO during IO completion. > > > > The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO > > that may require a call to generic_write_sync() in completion. That > > is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set, > > but still do polled IO. > > > > I think this is a bug. We don't want to be issuing more IO in > > REQ_POLLED task context during IO completion, and O_DSYNC IO > > completion for non-FUA IO requires a journal flush and that can > > issue lots of journal IO and wait on it in completion process. > > > > Hence I think we should only be setting REQ_POLLED in the cases > > where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set. If > > IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what > > context we are in at completion time or whether REQ_POLLED was set > > or cleared during the IO.... > > > > That means the above check should be: > > > > /* > > * We can only poll for single bio I/Os that can run inline > > * completion. > > */ > > if (need_zeroout || > > (iocb_is_dsync(dio->iocb) && !use_fua) || > > ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) > > dio->iocb->ki_flags &= ~IOCB_HIPRI; > > Looks like you are right, it would not be a great idea to handle that > off polled IO completion. It'd work just fine, but anything generating > more IO should go to a helper. I'll make that change. > > > or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP > > first: > > > > if (!(dio->flags & IOMAP_DIO_INLINE_COMP)) > > dio->iocb->ki_flags &= ~IOCB_HIPRI; > > > > Then we don't need to care about polled IO on the completion side at > > all at the iomap layer because it doesn't change the completion > > requirements at all... > > That still isn't true, because you can still happily issue as polled IO > and get it cleared and now have an IRQ based completion. This would work > for most cases, but eg xfs dio end_io handler will grab: > > spin_lock(&ip->i_flags_lock); > > if the inode got truncated. Maybe that can't happen because we did > inode_dio_begin() higher up? Yes, truncate, hole punch, etc block on inode_dio_wait() with the i_rwsem held which means it blocks new DIO submissions and waits until all in-flight DIO before the truncate operation starts. inode_dio_complete() does not get called until after the filesystem ->endio completion has run, so there's no possibility of truncate-like operations actually racing with DIO completion at all... > Still seems saner to check for the polled > flag at completion to me... I disagree. If truncate (or anything that removes extents or reduces inode size) is running whilst DIO to that range is still in progress, we have a use-after-free situation that will cause data and/or filesystem corruption. It's just not a safe thing to allow, so we prevent it from occurring at a high level in the filesystem and the result is that low level IO code just doesn't need to care about races with layout/size changing operations... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx