On 4/7/22 11:15 PM, Luís Henriques wrote:
When doing a direct/sync write, we need to invalidate the page cache in
the range being written to. If we don't do this, the cache will include
invalid data as we just did a write that avoided the page cache.
Signed-off-by: Luís Henriques <lhenriques@xxxxxxx>
---
fs/ceph/file.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
Changes since v3:
- Dropped initial call to invalidate_inode_pages2_range()
- Added extra comment to document invalidation
Changes since v2:
- Invalidation needs to be done after a write
Changes since v1:
- Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
- Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5072570c2203..97f764b2fbdd 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
return ret;
ceph_fscache_invalidate(inode, false);
- ret = invalidate_inode_pages2_range(inode->i_mapping,
- pos >> PAGE_SHIFT,
- (pos + count - 1) >> PAGE_SHIFT);
- if (ret < 0)
- dout("invalidate_inode_pages2_range returned %d\n", ret);
while ((len = iov_iter_count(from)) > 0) {
size_t left;
@@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
break;
}
ceph_clear_error_write(ci);
+
+ /*
+ * we need to invalidate the page cache here, otherwise the
+ * cache will include invalid data in direct/sync writes.
+ */
+ ret = invalidate_inode_pages2_range(
+ inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ (pos + len - 1) >> PAGE_SHIFT);
+ if (ret < 0) {
+ dout("invalidate_inode_pages2_range returned %d\n",
+ ret);
+ ret = 0;
For this, IMO it's not safe. If we just ignore it the pagecache will
still have invalid data.
I think what the 'ceph_direct_read_write()' does is more correct, it
will make sure all the dirty pages are writeback from the pagecaches by
using 'invalidate_inode_pages2_range()' without blocking and later will
do the invalidate blocked by using 'truncate_inode_pages_range()' if
some pages are not unmaped in 'invalidate_inode_pages2_range()' when EBUSY.
This can always be sure that the pagecache has no invalid data after
write finishes. I think why it use the truncate helper here is because
it's safe and there shouldn't have any buffer write happen for DIO ?
But from my understanding the 'ceph_direct_read_write()' is still buggy.
What if the page fault happen just after 'truncate_inode_pages_range()'
? Will this happen ? Should we leave this to use the file lock to
guarantee it in user space ?
Thought ?
-- Xiubo
+ }
pos += len;
written += len;
dout("sync_write written %d\n", written);