Xiubo Li <xiubli@xxxxxxxxxx> writes: > On 4/6/22 9:41 PM, Jeff Layton wrote: >> On Wed, 2022-04-06 at 21:10 +0800, Xiubo Li wrote: >>> On 4/6/22 7:48 PM, Jeff Layton wrote: >>>> On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote: >>>>> Xiubo Li <xiubli@xxxxxxxxxx> writes: >>>>> >>>>>> On 4/6/22 6:57 PM, Luís Henriques wrote: >>>>>>> Xiubo Li <xiubli@xxxxxxxxxx> writes: >>>>>>> >>>>>>>> On 4/1/22 9:32 PM, Luís Henriques wrote: >>>>>>>>> When doing DIO on an encrypted node, we need to invalidate the page cache in >>>>>>>>> the range being written to, otherwise the cache will include invalid data. >>>>>>>>> >>>>>>>>> Signed-off-by: Luís Henriques <lhenriques@xxxxxxx> >>>>>>>>> --- >>>>>>>>> fs/ceph/file.c | 11 ++++++++++- >>>>>>>>> 1 file changed, 10 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> Changes since v1: >>>>>>>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range >>>>>>>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO >>>>>>>>> >>>>>>>>> Note: I'm not really sure this last change is required, it doesn't really >>>>>>>>> affect generic/647 result, but seems to be the most correct. >>>>>>>>> >>>>>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >>>>>>>>> index 5072570c2203..b2743c342305 100644 >>>>>>>>> --- a/fs/ceph/file.c >>>>>>>>> +++ b/fs/ceph/file.c >>>>>>>>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos, >>>>>>>>> if (ret < 0) >>>>>>>>> return ret; >>>>>>>>> - ceph_fscache_invalidate(inode, false); >>>>>>>>> + ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT)); >>>>>>>>> ret = invalidate_inode_pages2_range(inode->i_mapping, >>>>>>>>> pos >> PAGE_SHIFT, >>>>>>>>> (pos + count - 1) >> PAGE_SHIFT); >>>>>>>> The above has already invalidated the pages, why doesn't it work ? >>>>>>> I suspect the reason is because later on we loop through the number of >>>>>>> pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages(). >>>>>> Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will >>>>>> kunmap them again later. And they shouldn't update the i_mapping if I didn't >>>>>> miss something important. >>>>>> >>>>>> For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace, >>>>>> IMO if it needs to map the page and it should also unmap it just like in >>>>>> 'copy_page_from_iter()'. >>>>>> >>>>>> I thought it possibly be when we need to do RMW, it may will update the >>>>>> i_mapping when reading contents, but I checked the code didn't find any >>>>>> place is doing this. So I am wondering where tha page caches come from ? If that >>>>>> page caches really from reading the contents, then we should discard it instead >>>>>> of flushing it back ? >>>>>> >>>>>> BTW, what's the problem without this fixing ? xfstest fails ? >>>>> Yes, generic/647 fails if you run it with test_dummy_encryption. And I've >>>>> also checked that the RMW code was never executed in this test. >>>>> >>>>> But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could >>>>> change the inode->i_mapping. >>>>> >>>> No, kmap/unmap are all about high memory and 32-bit architectures. Those >>>> functions are usually no-ops on 64-bit arches. >>> Yeah, right. >>> >>> So they do nothing here. >>> >>>>> In my debugging this seemed to be the case >>>>> for the O_DIRECT path. That's why I added this extra call here. >>>>> >>>> I agree with Xiubo that we really shouldn't need to invalidate multiple >>>> times. >>>> >>>> I guess in this test, we have a DIO write racing with an mmap read >>>> Probably what's happening is either that we can't invalidate the page >>>> because it needs to be cleaned, or the mmap read is racing in just after >>>> the invalidate occurs but before writeback. >>> This sounds a possible case. >>> >>> >>>> In any case, it might be interesting to see whether you're getting >>>> -EBUSY back from the new invalidate_inode_pages2 calls with your patch. >>>> >>> If it's really this case maybe this should be retried some where ? >>> >> Possibly, or we may need to implement ->launder_folio. >> >> Either way, we need to understand what's happening first and then we can >> figure out a solution for it. > > Yeah, make sense. > OK, so here's what I got so far: When we run this test *without* test_dummy_encryption, ceph_direct_read_write() will be called and invalidate_inode_pages2_range() will do pretty much nothing because the mapping will be empty (mapping_empty(inode->i_mapping) will return 1). If we use encryption, ceph_sync_write() will be called instead and the mapping, obviously, be will be empty as well. The difference between in encrypted vs non-encrypted (and the reason the test passes without encryption) is that ceph_direct_read_write() (non-encrypted) will call truncate_inode_pages_range() at a stage where the mapping is not empty anymore (iter_get_bvecs_alloc will take care of that). In the encryption path (ceph_sync_write) the mapping will be filled with copy_page_from_iter(), which will fault and do the read. Because we don't have the truncate_inode_pages_range(), the cache will contain invalid data after the write. And that's why the extra invalidate_inode_pages2_range (or truncate_...) fixes this. Cheers, -- Luís