Re: [PATCH v2] fs: Fix page cache inconsistency when mixing buffered and AIO DIO

Lukas Czerner <lczerner@xxxxxxxxxx> · Mon, 17 Jul 2017 17:28:12 +0200

On Mon, Jul 17, 2017 at 05:12:28PM +0200, Jan Kara wrote:
> On Fri 14-07-17 17:40:23, Lukas Czerner wrote:
> > Currently when mixing buffered reads and asynchronous direct writes it
> > is possible to end up with the situation where we have stale data in the
> > page cache while the new data is already written to disk. This is
> > permanent until the affected pages are flushed away. Despite the fact
> > that mixing buffered and direct IO is ill-advised it does pose a thread
> > for a data integrity, is unexpected and should be fixed.
> > 
> > Fix this by deferring completion of asynchronous direct writes to a
> > process context in the case that there are mapped pages to be found in
> > the inode. Later before the completion in dio_complete() invalidate
> > the pages in question. This ensures that after the completion the pages
> > in the written area are either unmapped, or populated with up-to-date
> > data. Also do the same for the iomap case which uses
> > iomap_dio_complete() instead.
> > 
> > This has a side effect of deferring the completion to a process context
> > for every AIO DIO that happens on inode that has pages mapped. However
> > since the consensus is that this is ill-advised practice the performance
> > implication should not be a problem.
> > 
> > This was based on proposal from Jeff Moyer, thanks!
> > 
> > Signed-off-by: Lukas Czerner <lczerner@xxxxxxxxxx>
> > Cc: Jeff Moyer <jmoyer@xxxxxxxxxx>
> 
> OK, this looks like it could work. Some comments below.
> 
> >  fs/direct-io.c | 31 ++++++++++++++++++++++++++-----
> >  fs/iomap.c     |  7 +++++++
> >  2 files changed, 33 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 08cf278..2db9ada 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -258,6 +258,11 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
> >  	if (ret == 0)
> >  		ret = transferred;
> >  
> > +	if ((dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages))
> 
> Superfluous braces here... Also you should not call
> invalidate_inode_pages2_range() in case of error I suppose.

Sure, I'll fix the braces.

About the error case, is it not possible that some data has already been
writtent to the disk despite the error ?

Thanks!
-Lukas

> 
> > +		invalidate_inode_pages2_range(dio->inode->i_mapping,
> > +					offset >> PAGE_SHIFT,
> > +					(offset + ret - 1) >> PAGE_SHIFT);
> > +
> >  	if (dio->end_io) {
> >  		int err;
> >  
> > @@ -304,6 +309,7 @@ static void dio_bio_end_aio(struct bio *bio)
> >  	struct dio *dio = bio->bi_private;
> >  	unsigned long remaining;
> >  	unsigned long flags;
> > +	bool defer_completion = false;
> >  
> >  	/* cleanup the bio */
> >  	dio_bio_complete(dio, bio);
> > @@ -315,7 +321,19 @@ static void dio_bio_end_aio(struct bio *bio)
> >  	spin_unlock_irqrestore(&dio->bio_lock, flags);
> >  
> >  	if (remaining == 0) {
> > -		if (dio->result && dio->defer_completion) {
> > +		/*
> > +		 * Defer completion when defer_completion is set or
> > +		 * when the inode has pages mapped and this is AIO write.
> > +		 * We need to invalidate those pages because there is a
> > +		 * chance they contain stale data in the case buffered IO
> > +		 * went in between AIO submission and completion into the
> > +		 * same region.
> > +		 */
> > +		if (dio->result)
> > +			defer_completion = dio->defer_completion ||
> > +					   (dio->op == REQ_OP_WRITE &&
> > +					    dio->inode->i_mapping->nrpages);
> > +		if (defer_completion) {
> >  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
> >  			queue_work(dio->inode->i_sb->s_dio_done_wq,
> >  				   &dio->complete_work);
> > @@ -1210,10 +1228,13 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> >  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
> >  	 * so that we can call ->fsync.
> >  	 */
> > -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> > -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> > -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> > -		retval = dio_set_defer_completion(dio);
> > +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> > +		retval = 0;
> > +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> > +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> > +			retval = dio_set_defer_completion(dio);
> > +		else if (!dio->inode->i_sb->s_dio_done_wq)
> > +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> 
> Please add a comment explaining why sb_init_dio_done_wq() is needed here.

ok, thanks.

> 
> >  		if (retval) {
> >  			/*
> >  			 * We grab i_mutex only for reads so we don't have
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index 1732228..3baeed2 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -713,8 +713,15 @@ struct iomap_dio {
> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >  {
> >  	struct kiocb *iocb = dio->iocb;
> > +	loff_t offset = iocb->ki_pos;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> >  	ssize_t ret;
> >  
> > +	if ((dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages)
> 
> Again I don't think you want to invalidate pages in case DIO failed with an
> error...
> 
> > +		invalidate_inode_pages2_range(inode->i_mapping,
> > +				offset >> PAGE_SHIFT,
> > +				(offset + dio->size - 1) >> PAGE_SHIFT);
> > +
> >  	if (dio->end_io) {
> >  		ret = dio->end_io(iocb,
> >  				dio->error ? dio->error : dio->size,
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR