Re: Fc cap revoke in Ceph-fuse

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I got debug log from fuse (15.2.4), it seems strange. I was expecting
if client doestn have Fc caps, it will not put data into pagecache
however it seems like client does.

The client log shows that in Step 4, Reader doesnt have Fc caps, which
is expected, so Reder read from OSD. However, it seems still add the
data into page cache.  Though I dont understand why we see  Fc grant
and revoke from Getattr , but seems the revoke hasnt triggered the
cache invalidation.

Kernel driver works perfectly fine.

The sequence is
1. Writer open file /test
2. Writer write 0~ 128 with "A"
Those data stays in Pagecache has Writer has Fb caps

2020-09-28T20:02:05.404-0700 7f2a45c7e700 10 client.197996755
get_quota_root 0x100053bc886.head -> 0x1.head
2020-09-28T20:02:05.404-0700 7f2a45c7e700  7 client.197996755 wrote to
128, extending file size
2020-09-28T20:02:05.404-0700 7f2a45c7e700 10 mark_caps_dirty
0x100053bc886.head(faked_ino=0 ref=6 ll_ref=1
cap_refs={4=0,1024=1,4096=1,8192=1} open={3=1} mode=100000
size=128/4194304 nlink=1 btime=2020-09-28T20:02:00.762693-0700
mtime=2020-09-28T20:02:05.406381-0700
ctime=2020-09-28T20:02:05.406381-0700
caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw
objectset[0x100053bc886 ts 0/0 objects 1 dirty_or_tx 128]
parents=0x1.head["test_file"] 0x7f2a2aab93f0) Fw -> Fw
2020-09-28T20:02:05.404-0700 7f2a45c7e700  3 client.197996755 ll_write
0x7f2a10087130 0~128 = 128

3. Reader open file /test  ,  Reader got pAsLsXsFrw on Inode
0x100053bc886, On writer side,  a lot of caps revoke and grant
happened, the dirty data been flushed to OSD , FcFb were revoked as
well as InodeCache been invalidated, writer end up with pAsLsXsFrw

Reader:
2020-09-28T20:02:09.054-0700 7f725197e700  1 --
10.161.98.55:0/768076306 <== mds.0 v2:10.199.116.118:6800/917715634
2079429 ==== client_caps(grant ino 0x100053bc886 8821836128 seq 7
caps=pAsLsXsFrw dirty=- wanted=pAsxXsxFsxcrwb follows 0 size
128/4194304 ts 1/18446744073709551615 mtime
2020-09-28T20:02:05.406381-0700) v11 ==== 252+0+0 (crc 0 0 0)
0x7f723c0b1e50 con 0x7f7238025830

Writer
2020-09-28T20:02:09.048-0700 7f2a46688700  5 client.197996755
handle_cap_grant on in 0x100053bc886 mds.0 seq 9 caps now pAsLsXsFrw
was pAsLsXsxFsxcrwb

2020-09-28T20:02:09.048-0700 7f2a46688700  1 --
10.161.62.135:0/3664873675 -->
[v2:10.212.0.88:6904/1387975,v1:10.212.0.88:6905/1387975] --
osd_op(unknown.0.0:7042 21.9a3 21:c597c290:::100053bc886.00000000:head
[write 0~128 in=128b] snapc 1=[] ondisk+write+known_if_redirected
e3854718) v8 -- 0x7f2a2897b950 con 0x7f2a340abc20

2020-09-28T20:02:09.052-0700 7f2a46e90700  1 --
10.161.62.135:0/3664873675 <== osd.2592 v2:10.212.0.88:6904/1387975 1
==== osd_op_reply(7042 100053bc886.00000000 [write 0~128]
v3854718'1936063 uv1936063 ondisk = 0) v8 ==== 164+0+0 (crc 0 0 0)
0x7f2a380030b0 con 0x7f2a340abc20

2020-09-28T20:02:09.052-0700 7f2a4688a700 10 client.197996755
_invalidate_inode_cache 0x100053bc886.head(faked_ino=0 ref=6 ll_ref=1
cap_refs={4=0,1024=0,4096=0,8192=0} open={3=1} mode=100000
size=128/4194304 nlink=1 btime=0.000000
mtime=2020-09-28T20:02:05.406381-0700
ctime=2020-09-28T20:02:05.406381-0700 caps=pAsLsXsFrw(0=pAsLsXsFrw)
objectset[0x100053bc886 ts 0/0 objects 1 dirty_or_tx 0]
parents=0x1.head["test_file"] 0x7f2a2aab93f0)

2020-09-28T20:02:09.052-0700 7f2a4688a700 10 client.197996755  cap
mds.0 issued pAsLsXsFrw implemented pAsLsXsxFsxcrwb revoking XxFsxcb

2020-09-28T20:02:09.052-0700 7f2a4688a700 10 client.197996755
completed revocation of XxFsxcb

4. Reader read 0~128, Got "A",
It read from OSD, followed with an getattr as we reached EOF.   Reader
got Fc caps from the getattr(why? ) and later been revoked.

2020-09-28T20:02:17.478-0700 7f7250c71700 10 client.198003750
send_request client_request(unknown.0:3168408 getattr Fs
#0x100053bc886 2020-09-28T20:02:17.481623-0700 caller_uid=0,
caller_gid=0{0,}) v4 to mds.0

2020-09-28T20:02:17.478-0700 7f7250c71700  1 --
10.161.98.55:0/768076306 -->
[v2:10.199.116.118:6800/917715634,v1:10.199.116.118:6801/917715634] --
client_request(unknown.0:3168408 getattr Fs #0x100053bc886
2020-09-28T20:02:17.481623-0700 caller_uid=0, caller_gid=0{0,}) v4 --
0x7f720c057d00 con 0x7f7238025830

2020-09-28T20:02:17.486-0700 7f725197e700 12 client.198003750
add_update_inode had 0x100053bc886.head(faked_ino=0 ref=4 ll_ref=1
cap_refs={2048=0} open={3=1} mode=100000 size=128/4194304 nlink=1
btime=0.000000 mtime=2020-09-28T20:02:05.406381-0700
ctime=2020-09-28T20:02:05.406381-0700 caps=pAsLsXsFr(0=pAsLsXsFr)
objectset[0x100053bc886 ts 0/0 objects 0 dirty_or_tx 0]
parents=0x1.head["test_file"] 0x7f72428eb8e0) caps pAsLsXsFscr

2020-09-28T20:02:17.486-0700 7f725197e700  5 client.198003750
handle_cap_grant on in 0x100053bc886 mds.0 seq 10 caps now pAsLsXsFr
was pAsLsXsFscr

5. Writer write 0~128 with  "B"
Nothing special,  directly write to OSD.

6.  Reader read 0~128, still got "A"
Client reads from cache,  the ceph-fuse seems even not been called.



-Xiaoxi
Jeff Layton <jlayton@xxxxxxxxxx> 于2020年9月29日周二 上午2:26写道:
>
> On Sat, 2020-09-26 at 08:57 +0800, Xiaoxi Chen wrote:
> > Hi Jeff,
> >
> >        Yes Step 5 is where the issue is.  Client A (Fuse) should but
> > not do a synchronous read to OSD since the old data from Step 3 still
> > in its pagecache.  This might be the issue of fuse
> > (https://libfuse.github.io/doxygen/notify__inval__inode_8c.html) and
> > kernel driver doesnt have this issue,  but it would be great if you
> > can share how kernel driver interacting with pagecache? especially
> > without Fc
> >
> > -Xiaoxi
> >
>
> Yeah, it seems like when you lose Fc caps, then you need to invalidate
> the pagecache. FUSE has an upcall for that, but it looks like it's done
> asynchronously. I suppose a read could race in before that happens.
>
> The right thing to do is probably to not let the FUSE client code return
> Fc caps back to the MDS until the pagecache is invalidated.
>
> In the kernel, without Fc, read() syscalls (and similar) don't go
> through the pagecache at all. ceph_read_iter/write_iter will dispatch
> I/O to the OSDs directly and the results are not cached.
>
> None of this behaves very well with mmap, btw. We sort of _have_ to go
> through the pagecache for mmap. For that, you probably ought to make
> sure you're using some sort of locking if you want to do this sort of
> I/O pattern across clients.
>
>
> > Jeff Layton <jlayton@xxxxxxxxxx> 于2020年9月25日周五 下午8:02写道:
> > > I'm less familiar with the fuse client than the kernel one, but this
> > > sounds wrong.
> > >
> > > In step 5, Client A should just do a synchronous read from the OSD since
> > > it no longer has Fc caps. Why is it seeing old data? Has Client B just
> > > not yet sent issued the write to the OSD? If so, was Client B issued Fb
> > > caps?
> > >
> > > -- Jeff
> > >
> > > On Thu, 2020-09-24 at 15:34 +0800, Xiaoxi Chen wrote:
> > > > Could you explain why client can add page cache later?   Please
> > > > correct where it is wrong.
> > > >
> > > >      1. Client A has page cache of file X
> > > >      2.  Client B open X for write,  it will take write lock and MDS
> > > > will revoke Fc of Client A, which will result in Client A drop its
> > > > cache.
> > > >      3.  Client A try to read X, Client A can go ahead to read  from
> > > > OSD, which gets the old data. (will clinet A issue a getattr to MDS?
> > > > will the getattr been blocked?  I see some discussion pointing to
> > > > https://github.com/ukernel/ceph/commit/7db1563416b5559310dbbc834795b83a4ccdaab4)
> > > >      4.  Client B writes data
> > > >      5. Client A still get old data.
> > > >
> > > > -Xiaoxi
> > > >
> > > > Yan, Zheng <ukernel@xxxxxxxxx> 于2020年9月24日周四 上午11:50写道:
> > > > > On Thu, Sep 24, 2020 at 11:07 AM Xiaoxi Chen <superdebuger@xxxxxxxxx> wrote:
> > > > > > Hi zheng,
> > > > > >       We are seeing inconsistent among clients ones one client update a file(by scp), some of the node see new contents but some of the nodes don't.   The inconsistent can last 30mins to a few hours and fix by its own.  I think it should because some of the node not dropping the page cache properly.
> > > > > >      Looking into the code I see when Fc cap revoke,  fuse client drop objectcache , queue a task to finisher thread to do fuse_lowlevel_notify_inval_inode,  then ack the cap revoke.  So seems there is a window between the cap-revoke-ack , and the final fuse_lowlevel_notify_inval_inode finished, in this window page cache still valid and user can read stale data.    Though it is strange that the window can be that large(no pg issue during the window).
> > > > > >      Could you please confirm if this is the real problems and why it is implemented in this way?
> > > > > >
> > > > >
> > > > > yes, it's real problem.  fuse_lowlevel_notify_inval_inode() does not
> > > > > prevent client add page cache later. If there are multiple fuse
> > > > > clients read/modify same file, you'd better to set
> > > > > fuse_disable_pagecache option to true.
> > > > >
> > > > > > -xiaoxi
> > > > _______________________________________________
> > > > Dev mailing list -- dev@xxxxxxx
> > > > To unsubscribe send an email to dev-leave@xxxxxxx
> > >
> > > --
> > > Jeff Layton <jlayton@xxxxxxxxxx>
> > >
>
> --
> Jeff Layton <jlayton@xxxxxxxxxx>
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux