I got debug log from fuse (15.2.4), it seems strange. I was expecting if client doestn have Fc caps, it will not put data into pagecache however it seems like client does. The client log shows that in Step 4, Reader doesnt have Fc caps, which is expected, so Reder read from OSD. However, it seems still add the data into page cache. Though I dont understand why we see Fc grant and revoke from Getattr , but seems the revoke hasnt triggered the cache invalidation. Kernel driver works perfectly fine. The sequence is 1. Writer open file /test 2. Writer write 0~ 128 with "A" Those data stays in Pagecache has Writer has Fb caps 2020-09-28T20:02:05.404-0700 7f2a45c7e700 10 client.197996755 get_quota_root 0x100053bc886.head -> 0x1.head 2020-09-28T20:02:05.404-0700 7f2a45c7e700 7 client.197996755 wrote to 128, extending file size 2020-09-28T20:02:05.404-0700 7f2a45c7e700 10 mark_caps_dirty 0x100053bc886.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=1,4096=1,8192=1} open={3=1} mode=100000 size=128/4194304 nlink=1 btime=2020-09-28T20:02:00.762693-0700 mtime=2020-09-28T20:02:05.406381-0700 ctime=2020-09-28T20:02:05.406381-0700 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw objectset[0x100053bc886 ts 0/0 objects 1 dirty_or_tx 128] parents=0x1.head["test_file"] 0x7f2a2aab93f0) Fw -> Fw 2020-09-28T20:02:05.404-0700 7f2a45c7e700 3 client.197996755 ll_write 0x7f2a10087130 0~128 = 128 3. Reader open file /test , Reader got pAsLsXsFrw on Inode 0x100053bc886, On writer side, a lot of caps revoke and grant happened, the dirty data been flushed to OSD , FcFb were revoked as well as InodeCache been invalidated, writer end up with pAsLsXsFrw Reader: 2020-09-28T20:02:09.054-0700 7f725197e700 1 -- 10.161.98.55:0/768076306 <== mds.0 v2:10.199.116.118:6800/917715634 2079429 ==== client_caps(grant ino 0x100053bc886 8821836128 seq 7 caps=pAsLsXsFrw dirty=- wanted=pAsxXsxFsxcrwb follows 0 size 128/4194304 ts 1/18446744073709551615 mtime 2020-09-28T20:02:05.406381-0700) v11 ==== 252+0+0 (crc 0 0 0) 0x7f723c0b1e50 con 0x7f7238025830 Writer 2020-09-28T20:02:09.048-0700 7f2a46688700 5 client.197996755 handle_cap_grant on in 0x100053bc886 mds.0 seq 9 caps now pAsLsXsFrw was pAsLsXsxFsxcrwb 2020-09-28T20:02:09.048-0700 7f2a46688700 1 -- 10.161.62.135:0/3664873675 --> [v2:10.212.0.88:6904/1387975,v1:10.212.0.88:6905/1387975] -- osd_op(unknown.0.0:7042 21.9a3 21:c597c290:::100053bc886.00000000:head [write 0~128 in=128b] snapc 1=[] ondisk+write+known_if_redirected e3854718) v8 -- 0x7f2a2897b950 con 0x7f2a340abc20 2020-09-28T20:02:09.052-0700 7f2a46e90700 1 -- 10.161.62.135:0/3664873675 <== osd.2592 v2:10.212.0.88:6904/1387975 1 ==== osd_op_reply(7042 100053bc886.00000000 [write 0~128] v3854718'1936063 uv1936063 ondisk = 0) v8 ==== 164+0+0 (crc 0 0 0) 0x7f2a380030b0 con 0x7f2a340abc20 2020-09-28T20:02:09.052-0700 7f2a4688a700 10 client.197996755 _invalidate_inode_cache 0x100053bc886.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=0,4096=0,8192=0} open={3=1} mode=100000 size=128/4194304 nlink=1 btime=0.000000 mtime=2020-09-28T20:02:05.406381-0700 ctime=2020-09-28T20:02:05.406381-0700 caps=pAsLsXsFrw(0=pAsLsXsFrw) objectset[0x100053bc886 ts 0/0 objects 1 dirty_or_tx 0] parents=0x1.head["test_file"] 0x7f2a2aab93f0) 2020-09-28T20:02:09.052-0700 7f2a4688a700 10 client.197996755 cap mds.0 issued pAsLsXsFrw implemented pAsLsXsxFsxcrwb revoking XxFsxcb 2020-09-28T20:02:09.052-0700 7f2a4688a700 10 client.197996755 completed revocation of XxFsxcb 4. Reader read 0~128, Got "A", It read from OSD, followed with an getattr as we reached EOF. Reader got Fc caps from the getattr(why? ) and later been revoked. 2020-09-28T20:02:17.478-0700 7f7250c71700 10 client.198003750 send_request client_request(unknown.0:3168408 getattr Fs #0x100053bc886 2020-09-28T20:02:17.481623-0700 caller_uid=0, caller_gid=0{0,}) v4 to mds.0 2020-09-28T20:02:17.478-0700 7f7250c71700 1 -- 10.161.98.55:0/768076306 --> [v2:10.199.116.118:6800/917715634,v1:10.199.116.118:6801/917715634] -- client_request(unknown.0:3168408 getattr Fs #0x100053bc886 2020-09-28T20:02:17.481623-0700 caller_uid=0, caller_gid=0{0,}) v4 -- 0x7f720c057d00 con 0x7f7238025830 2020-09-28T20:02:17.486-0700 7f725197e700 12 client.198003750 add_update_inode had 0x100053bc886.head(faked_ino=0 ref=4 ll_ref=1 cap_refs={2048=0} open={3=1} mode=100000 size=128/4194304 nlink=1 btime=0.000000 mtime=2020-09-28T20:02:05.406381-0700 ctime=2020-09-28T20:02:05.406381-0700 caps=pAsLsXsFr(0=pAsLsXsFr) objectset[0x100053bc886 ts 0/0 objects 0 dirty_or_tx 0] parents=0x1.head["test_file"] 0x7f72428eb8e0) caps pAsLsXsFscr 2020-09-28T20:02:17.486-0700 7f725197e700 5 client.198003750 handle_cap_grant on in 0x100053bc886 mds.0 seq 10 caps now pAsLsXsFr was pAsLsXsFscr 5. Writer write 0~128 with "B" Nothing special, directly write to OSD. 6. Reader read 0~128, still got "A" Client reads from cache, the ceph-fuse seems even not been called. -Xiaoxi Jeff Layton <jlayton@xxxxxxxxxx> 于2020年9月29日周二 上午2:26写道: > > On Sat, 2020-09-26 at 08:57 +0800, Xiaoxi Chen wrote: > > Hi Jeff, > > > > Yes Step 5 is where the issue is. Client A (Fuse) should but > > not do a synchronous read to OSD since the old data from Step 3 still > > in its pagecache. This might be the issue of fuse > > (https://libfuse.github.io/doxygen/notify__inval__inode_8c.html) and > > kernel driver doesnt have this issue, but it would be great if you > > can share how kernel driver interacting with pagecache? especially > > without Fc > > > > -Xiaoxi > > > > Yeah, it seems like when you lose Fc caps, then you need to invalidate > the pagecache. FUSE has an upcall for that, but it looks like it's done > asynchronously. I suppose a read could race in before that happens. > > The right thing to do is probably to not let the FUSE client code return > Fc caps back to the MDS until the pagecache is invalidated. > > In the kernel, without Fc, read() syscalls (and similar) don't go > through the pagecache at all. ceph_read_iter/write_iter will dispatch > I/O to the OSDs directly and the results are not cached. > > None of this behaves very well with mmap, btw. We sort of _have_ to go > through the pagecache for mmap. For that, you probably ought to make > sure you're using some sort of locking if you want to do this sort of > I/O pattern across clients. > > > > Jeff Layton <jlayton@xxxxxxxxxx> 于2020年9月25日周五 下午8:02写道: > > > I'm less familiar with the fuse client than the kernel one, but this > > > sounds wrong. > > > > > > In step 5, Client A should just do a synchronous read from the OSD since > > > it no longer has Fc caps. Why is it seeing old data? Has Client B just > > > not yet sent issued the write to the OSD? If so, was Client B issued Fb > > > caps? > > > > > > -- Jeff > > > > > > On Thu, 2020-09-24 at 15:34 +0800, Xiaoxi Chen wrote: > > > > Could you explain why client can add page cache later? Please > > > > correct where it is wrong. > > > > > > > > 1. Client A has page cache of file X > > > > 2. Client B open X for write, it will take write lock and MDS > > > > will revoke Fc of Client A, which will result in Client A drop its > > > > cache. > > > > 3. Client A try to read X, Client A can go ahead to read from > > > > OSD, which gets the old data. (will clinet A issue a getattr to MDS? > > > > will the getattr been blocked? I see some discussion pointing to > > > > https://github.com/ukernel/ceph/commit/7db1563416b5559310dbbc834795b83a4ccdaab4) > > > > 4. Client B writes data > > > > 5. Client A still get old data. > > > > > > > > -Xiaoxi > > > > > > > > Yan, Zheng <ukernel@xxxxxxxxx> 于2020年9月24日周四 上午11:50写道: > > > > > On Thu, Sep 24, 2020 at 11:07 AM Xiaoxi Chen <superdebuger@xxxxxxxxx> wrote: > > > > > > Hi zheng, > > > > > > We are seeing inconsistent among clients ones one client update a file(by scp), some of the node see new contents but some of the nodes don't. The inconsistent can last 30mins to a few hours and fix by its own. I think it should because some of the node not dropping the page cache properly. > > > > > > Looking into the code I see when Fc cap revoke, fuse client drop objectcache , queue a task to finisher thread to do fuse_lowlevel_notify_inval_inode, then ack the cap revoke. So seems there is a window between the cap-revoke-ack , and the final fuse_lowlevel_notify_inval_inode finished, in this window page cache still valid and user can read stale data. Though it is strange that the window can be that large(no pg issue during the window). > > > > > > Could you please confirm if this is the real problems and why it is implemented in this way? > > > > > > > > > > > > > > > > yes, it's real problem. fuse_lowlevel_notify_inval_inode() does not > > > > > prevent client add page cache later. If there are multiple fuse > > > > > clients read/modify same file, you'd better to set > > > > > fuse_disable_pagecache option to true. > > > > > > > > > > > -xiaoxi > > > > _______________________________________________ > > > > Dev mailing list -- dev@xxxxxxx > > > > To unsubscribe send an email to dev-leave@xxxxxxx > > > > > > -- > > > Jeff Layton <jlayton@xxxxxxxxxx> > > > > > -- > Jeff Layton <jlayton@xxxxxxxxxx> > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx