On Tue, Mar 31, 2020 at 10:56 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Tue, 2020-03-31 at 07:00 -0700, Gregory Farnum wrote: > > On Tue, Mar 31, 2020 at 6:49 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > On Tue, Mar 31, 2020 at 6:52 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > Jan noticed some long stalls when flushing caps using sync() after > > > > doing small file creates. For instance, running this: > > > > > > > > $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done > > > > > > > > Could take more than 90s in some cases. The sync() will flush out caps, > > > > but doesn't tell the MDS that it's waiting synchronously on the > > > > replies. > > > > > > > > When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the > > > > CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS > > > > into that fact and it can then expedite the reply. > > > > > > > > URL: https://tracker.ceph.com/issues/44744 > > > > Reported-and-Tested-by: Jan Fajerski <jfajerski@xxxxxxxx> > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > > > --- > > > > fs/ceph/caps.c | 7 +++++-- > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > > > index 61808793e0c0..6403178f2376 100644 > > > > --- a/fs/ceph/caps.c > > > > +++ b/fs/ceph/caps.c > > > > @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags, > > > > > > > > mds = cap->mds; /* remember mds, so we don't repeat */ > > > > > > > > - __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want, > > > > - retain, flushing, flush_tid, oldest_flush_tid); > > > > + __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, > > > > + (flags & CHECK_CAPS_FLUSH) ? > > > > + CEPH_CLIENT_CAPS_SYNC : 0, > > > > + cap_used, want, retain, flushing, flush_tid, > > > > + oldest_flush_tid); > > > > spin_unlock(&ci->i_ceph_lock); > > > > > > > > > > this is too expensive for syncfs case. mds needs to flush journal for > > > each dirty inode. we'd better to track dirty inodes by session, and > > > only set the flag when flushing the last inode in session dirty list. > > I think this will be more difficult than that... > > > > > Yeah, see the userspace Client::_sync_fs() where we have an internal > > flags argument which is set on the last cap in the dirty set and tells > > the actual cap message flushing code to set FLAG_SYNC on the > > MClientCaps message. I presume the kernel is operating on a similar > > principle here? > > Not today, but we need it to. > > The caps are not tracked on a per-session basis (as Zheng points out), > and the locking and ordering of these requests is not as straightforward > as it is in the (trivial) libcephfs case. Fixing this will be a lot more > invasive than I had originally hoped. > > It's also not 100% clear to me how we'll gauge which cap will be > "last". As we move the last cap on the session's list from dirty to > flushing state, we can mark it so that it sets the SYNC flag when it > goes out, but what happens if we have a process that is actively > dirtying other inodes during this? You might never see the per-session > list go empty. > It's not necessary to be strict 'last'. just the one before exiting the loop > It may also go empty for reasons that have nothing to do with issuing a > sync(), (or fsync() or...) so we don't want to universally set this flag > in that case. > > I'm not sure what the right solution is yet. > -- > Jeff Layton <jlayton@xxxxxxxxxx> >