On Mon, Jul 17, 2017 at 05:12:28PM +0200, Jan Kara wrote: > On Fri 14-07-17 17:40:23, Lukas Czerner wrote: > > Currently when mixing buffered reads and asynchronous direct writes it > > is possible to end up with the situation where we have stale data in the > > page cache while the new data is already written to disk. This is > > permanent until the affected pages are flushed away. Despite the fact > > that mixing buffered and direct IO is ill-advised it does pose a thread > > for a data integrity, is unexpected and should be fixed. > > > > Fix this by deferring completion of asynchronous direct writes to a > > process context in the case that there are mapped pages to be found in > > the inode. Later before the completion in dio_complete() invalidate > > the pages in question. This ensures that after the completion the pages > > in the written area are either unmapped, or populated with up-to-date > > data. Also do the same for the iomap case which uses > > iomap_dio_complete() instead. > > > > This has a side effect of deferring the completion to a process context > > for every AIO DIO that happens on inode that has pages mapped. However > > since the consensus is that this is ill-advised practice the performance > > implication should not be a problem. > > > > This was based on proposal from Jeff Moyer, thanks! > > > > Signed-off-by: Lukas Czerner <lczerner@xxxxxxxxxx> > > Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> > > OK, this looks like it could work. Some comments below. > > > fs/direct-io.c | 31 ++++++++++++++++++++++++++----- > > fs/iomap.c | 7 +++++++ > > 2 files changed, 33 insertions(+), 5 deletions(-) > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index 08cf278..2db9ada 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async) > > if (ret == 0) > > ret = transferred; > > > > + if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages)) > > Superfluous braces here... Also you should not call > invalidate_inode_pages2_range() in case of error I suppose. Sure, I'll fix the braces. About the error case, is it not possible that some data has already been writtent to the disk despite the error ? Thanks! -Lukas > > > + invalidate_inode_pages2_range(dio->inode->i_mapping, > > + offset >> PAGE_SHIFT, > > + (offset + ret - 1) >> PAGE_SHIFT); > > + > > if (dio->end_io) { > > int err; > > > > @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio) > > struct dio *dio = bio->bi_private; > > unsigned long remaining; > > unsigned long flags; > > + bool defer_completion = false; > > > > /* cleanup the bio */ > > dio_bio_complete(dio, bio); > > @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio) > > spin_unlock_irqrestore(&dio->bio_lock, flags); > > > > if (remaining == 0) { > > - if (dio->result && dio->defer_completion) { > > + /* > > + * Defer completion when defer_completion is set or > > + * when the inode has pages mapped and this is AIO write. > > + * We need to invalidate those pages because there is a > > + * chance they contain stale data in the case buffered IO > > + * went in between AIO submission and completion into the > > + * same region. > > + */ > > + if (dio->result) > > + defer_completion = dio->defer_completion || > > + (dio->op == REQ_OP_WRITE && > > + dio->inode->i_mapping->nrpages); > > + if (defer_completion) { > > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > > queue_work(dio->inode->i_sb->s_dio_done_wq, > > &dio->complete_work); > > @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > > * For AIO O_(D)SYNC writes we need to defer completions to a workqueue > > * so that we can call ->fsync. > > */ > > - if (dio->is_async && iov_iter_rw(iter) == WRITE && > > - ((iocb->ki_filp->f_flags & O_DSYNC) || > > - IS_SYNC(iocb->ki_filp->f_mapping->host))) { > > - retval = dio_set_defer_completion(dio); > > + if (dio->is_async && iov_iter_rw(iter) == WRITE) { > > + retval = 0; > > + if ((iocb->ki_filp->f_flags & O_DSYNC) || > > + IS_SYNC(iocb->ki_filp->f_mapping->host)) > > + retval = dio_set_defer_completion(dio); > > + else if (!dio->inode->i_sb->s_dio_done_wq) > > + retval = sb_init_dio_done_wq(dio->inode->i_sb); > > Please add a comment explaining why sb_init_dio_done_wq() is needed here. ok, thanks. > > > if (retval) { > > /* > > * We grab i_mutex only for reads so we don't have > > diff --git a/fs/iomap.c b/fs/iomap.c > > index 1732228..3baeed2 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -713,8 +713,15 @@ struct iomap_dio { > > static ssize_t iomap_dio_complete(struct iomap_dio *dio) > > { > > struct kiocb *iocb = dio->iocb; > > + loff_t offset = iocb->ki_pos; > > + struct inode *inode = file_inode(iocb->ki_filp); > > ssize_t ret; > > > > + if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) > > Again I don't think you want to invalidate pages in case DIO failed with an > error... > > > + invalidate_inode_pages2_range(inode->i_mapping, > > + offset >> PAGE_SHIFT, > > + (offset + dio->size - 1) >> PAGE_SHIFT); > > + > > if (dio->end_io) { > > ret = dio->end_io(iocb, > > dio->error ? dio->error : dio->size, > > Honza > > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR