Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
> On 4/7/22 11:15 PM, Luís Henriques wrote:
> > When doing a direct/sync write, we need to invalidate the page cache in
> > the range being written to.  If we don't do this, the cache will include
> > invalid data as we just did a write that avoided the page cache.
> > 
> > Signed-off-by: Luís Henriques <lhenriques@xxxxxxx>
> > ---
> >   fs/ceph/file.c | 19 ++++++++++++++-----
> >   1 file changed, 14 insertions(+), 5 deletions(-)
> > 
> > Changes since v3:
> > - Dropped initial call to invalidate_inode_pages2_range()
> > - Added extra comment to document invalidation
> > 
> > Changes since v2:
> > - Invalidation needs to be done after a write
> > 
> > Changes since v1:
> > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > 
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 5072570c2203..97f764b2fbdd 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> >   		return ret;
> >   
> >   	ceph_fscache_invalidate(inode, false);
> > -	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > -					    pos >> PAGE_SHIFT,
> > -					    (pos + count - 1) >> PAGE_SHIFT);
> > -	if (ret < 0)
> > -		dout("invalidate_inode_pages2_range returned %d\n", ret);
> >   
> >   	while ((len = iov_iter_count(from)) > 0) {
> >   		size_t left;
> > @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> >   			break;
> >   		}
> >   		ceph_clear_error_write(ci);
> > +
> > +		/*
> > +		 * we need to invalidate the page cache here, otherwise the
> > +		 * cache will include invalid data in direct/sync writes.
> > +		 */
> > +		ret = invalidate_inode_pages2_range(
> 
> IMO we'd better use truncate_inode_pages_range() after write. The above 
> means it's possibly will write the dirty pagecache back, which will 
> overwrite and corrupt the disk data just wrote.
> 

I disagree. We call filemap_write_and_wait_range at the start of this,
so any data that was dirty when we called write() will be written back
before the sync write.

If we truncate the range, then we'll potentially lose writes that came
in after write was issued but before truncate_inode_pages_range. I think
we'd rather let what we just wrote be clobbered in this situation than
lose a write altogether.

All of this is somewhat academic though. If you're mixing buffered and
direct writes like this without some sort of locking, then you're just
asking for trouble. The aim here is "sane behavior to the best of our
ability", but we can't expect it to always be sane when people do insane
things. ;)

> Though it seems impossible that these pagecaches will be marked dirty, 
> but this call is misleading ?
> 

Not impossible at all. You can open a file O_DIRECT and then mmap the fd
for PROT_WRITE (or just open the file a second time and do it).

We definitely recommend against mixing buffered and direct I/O, but
nothing really prevents someone from doing it. If the user is properly
using file locking, then there's really no reason it shouldn't work.

> 
> > +				inode->i_mapping,
> > +				pos >> PAGE_SHIFT,
> > +				(pos + len - 1) >> PAGE_SHIFT);
> > +		if (ret < 0) {
> > +			dout("invalidate_inode_pages2_range returned %d\n",
> > +			     ret);
> > +			ret = 0;
> > +		}
> >   		pos += len;
> >   		written += len;
> >   		dout("sync_write written %d\n", written);
> > 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux