On Fri 14-07-17 17:40:23, Lukas Czerner wrote: > Currently when mixing buffered reads and asynchronous direct writes it > is possible to end up with the situation where we have stale data in the > page cache while the new data is already written to disk. This is > permanent until the affected pages are flushed away. Despite the fact > that mixing buffered and direct IO is ill-advised it does pose a thread > for a data integrity, is unexpected and should be fixed. > > Fix this by deferring completion of asynchronous direct writes to a > process context in the case that there are mapped pages to be found in > the inode. Later before the completion in dio_complete() invalidate > the pages in question. This ensures that after the completion the pages > in the written area are either unmapped, or populated with up-to-date > data. Also do the same for the iomap case which uses > iomap_dio_complete() instead. > > This has a side effect of deferring the completion to a process context > for every AIO DIO that happens on inode that has pages mapped. However > since the consensus is that this is ill-advised practice the performance > implication should not be a problem. > > This was based on proposal from Jeff Moyer, thanks! > > Signed-off-by: Lukas Czerner <lczerner@xxxxxxxxxx> > Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> OK, this looks like it could work. Some comments below. > fs/direct-io.c | 31 ++++++++++++++++++++++++++----- > fs/iomap.c | 7 +++++++ > 2 files changed, 33 insertions(+), 5 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 08cf278..2db9ada 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > if (ret == 0) > ret = transferred; > > + if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) Superfluous braces here... Also you should not call invalidate_inode_pages2_range() in case of error I suppose. > + invalidate_inode_pages2_range(dio->inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + ret - 1) >> PAGE_SHIFT); > + > if (dio->end_io) { > int err; > > @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio) > struct dio *dio = bio->bi_private; > unsigned long remaining; > unsigned long flags; > + bool defer_completion = false; > > /* cleanup the bio */ > dio_bio_complete(dio, bio); > @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio) > spin_unlock_irqrestore(&dio->bio_lock, flags); > > if (remaining == 0) { > - if (dio->result && dio->defer_completion) { > + /* > + * Defer completion when defer_completion is set or > + * when the inode has pages mapped and this is AIO write. > + * We need to invalidate those pages because there is a > + * chance they contain stale data in the case buffered IO > + * went in between AIO submission and completion into the > + * same region. > + */ > + if (dio->result) > + defer_completion = dio->defer_completion || > + (dio->op == REQ_OP_WRITE && > + dio->inode->i_mapping->nrpages); > + if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > * so that we can call ->fsync. > */ > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > - ((iocb->ki_filp->f_flags & O_DSYNC) || > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > - retval = dio_set_defer_completion(dio); > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > + retval = 0; > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > + retval = dio_set_defer_completion(dio); > + else if (!dio->inode->i_sb->s_dio_done_wq) > + retval = sb_init_dio_done_wq(dio->inode->i_sb); Please add a comment explaining why sb_init_dio_done_wq() is needed here. > if (retval) { > /* > * We grab i_mutex only for reads so we don't have > diff --git a/fs/iomap.c b/fs/iomap.c > index 1732228..3baeed2 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -713,8 +713,15 @@ struct iomap_dio { > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > { > struct kiocb *iocb = dio->iocb; > + loff_t offset = iocb->ki_pos; > + struct inode *inode = file_inode(iocb->ki_filp); > ssize_t ret; > > + if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) Again I don't think you want to invalidate pages in case DIO failed with an error... > + invalidate_inode_pages2_range(inode->i_mapping, > + offset >> PAGE_SHIFT, > + (offset + dio->size - 1) >> PAGE_SHIFT); > + > if (dio->end_io) { > ret = dio->end_io(iocb, > dio->error ? dio->error : dio->size, Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR