On Fri, 2022-04-08 at 03:24 +0800, Xiubo Li wrote: > On 4/8/22 3:16 AM, Jeff Layton wrote: > > On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote: > > > On 4/7/22 11:15 PM, Luís Henriques wrote: > > > > When doing a direct/sync write, we need to invalidate the page cache in > > > > the range being written to. If we don't do this, the cache will include > > > > invalid data as we just did a write that avoided the page cache. > > > > > > > > Signed-off-by: Luís Henriques <lhenriques@xxxxxxx> > > > > --- > > > > fs/ceph/file.c | 19 ++++++++++++++----- > > > > 1 file changed, 14 insertions(+), 5 deletions(-) > > > > > > > > Changes since v3: > > > > - Dropped initial call to invalidate_inode_pages2_range() > > > > - Added extra comment to document invalidation > > > > > > > > Changes since v2: > > > > - Invalidation needs to be done after a write > > > > > > > > Changes since v1: > > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range > > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO > > > > > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > > > > index 5072570c2203..97f764b2fbdd 100644 > > > > --- a/fs/ceph/file.c > > > > +++ b/fs/ceph/file.c > > > > @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos, > > > > return ret; > > > > > > > > ceph_fscache_invalidate(inode, false); > > > > - ret = invalidate_inode_pages2_range(inode->i_mapping, > > > > - pos >> PAGE_SHIFT, > > > > - (pos + count - 1) >> PAGE_SHIFT); > > > > - if (ret < 0) > > > > - dout("invalidate_inode_pages2_range returned %d\n", ret); > > > > > > > > while ((len = iov_iter_count(from)) > 0) { > > > > size_t left; > > > > @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos, > > > > break; > > > > } > > > > ceph_clear_error_write(ci); > > > > + > > > > + /* > > > > + * we need to invalidate the page cache here, otherwise the > > > > + * cache will include invalid data in direct/sync writes. > > > > + */ > > > > + ret = invalidate_inode_pages2_range( > > > IMO we'd better use truncate_inode_pages_range() after write. The above > > > means it's possibly will write the dirty pagecache back, which will > > > overwrite and corrupt the disk data just wrote. > > > > > I disagree. We call filemap_write_and_wait_range at the start of this, > > so any data that was dirty when we called write() will be written back > > before the sync write. > > > > If we truncate the range, then we'll potentially lose writes that came > > in after write was issued but before truncate_inode_pages_range. I think > > we'd rather let what we just wrote be clobbered in this situation than > > lose a write altogether. > > > > All of this is somewhat academic though. If you're mixing buffered and > > direct writes like this without some sort of locking, then you're just > > asking for trouble. The aim here is "sane behavior to the best of our > > ability", but we can't expect it to always be sane when people do insane > > things. ;) > > Just in the case Luis hit. Before writing the new data the mapping > happen when reading the src in copy_from_usr(). So once the writing done > the pagecache is caching the stale contents. > Not just in that case. You could have 2 unrelated processes, one doing DIO writes and one doing mmap writes. You're likely to end up with a mess unless you're very careful with what you're doing, but there should be some expectation that it will work if you serialize things correctly and/or have them writing to their own areas of the file, etc. In any case, we'll never get perfect cache coherency, and I figure that until the write returns, what's in the pagecache ought to be considered valid. > > > Though it seems impossible that these pagecaches will be marked dirty, > > > but this call is misleading ? > > > > > Not impossible at all. You can open a file O_DIRECT and then mmap the fd > > for PROT_WRITE (or just open the file a second time and do it). > > > > We definitely recommend against mixing buffered and direct I/O, but > > nothing really prevents someone from doing it. If the user is properly > > using file locking, then there's really no reason it shouldn't work. > > > > > > + inode->i_mapping, > > > > + pos >> PAGE_SHIFT, > > > > + (pos + len - 1) >> PAGE_SHIFT); > > > > + if (ret < 0) { > > > > + dout("invalidate_inode_pages2_range returned %d\n", > > > > + ret); > > > > + ret = 0; > > > > + } > > > > pos += len; > > > > written += len; > > > > dout("sync_write written %d\n", written); > > > > > -- Jeff Layton <jlayton@xxxxxxxxxx>