Re: [PATCH] ceph: request expedited service when flushing caps

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 01 Apr 2020 08:08:00 -0400

On Wed, 2020-04-01 at 19:04 +0800, Yan, Zheng wrote:
> On Tue, Mar 31, 2020 at 10:56 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Tue, 2020-03-31 at 07:00 -0700, Gregory Farnum wrote:
> > > On Tue, Mar 31, 2020 at 6:49 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> > > > On Tue, Mar 31, 2020 at 6:52 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > Jan noticed some long stalls when flushing caps using sync() after
> > > > > doing small file creates. For instance, running this:
> > > > > 
> > > > >     $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done
> > > > > 
> > > > > Could take more than 90s in some cases. The sync() will flush out caps,
> > > > > but doesn't tell the MDS that it's waiting synchronously on the
> > > > > replies.
> > > > > 
> > > > > When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
> > > > > CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
> > > > > into that fact and it can then expedite the reply.
> > > > > 
> > > > > URL: https://tracker.ceph.com/issues/44744
> > > > > Reported-and-Tested-by: Jan Fajerski <jfajerski@xxxxxxxx>
> > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/ceph/caps.c | 7 +++++--
> > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > index 61808793e0c0..6403178f2376 100644
> > > > > --- a/fs/ceph/caps.c
> > > > > +++ b/fs/ceph/caps.c
> > > > > @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> > > > > 
> > > > >                 mds = cap->mds;  /* remember mds, so we don't repeat */
> > > > > 
> > > > > -               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> > > > > -                          retain, flushing, flush_tid, oldest_flush_tid);
> > > > > +               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
> > > > > +                          (flags & CHECK_CAPS_FLUSH) ?
> > > > > +                           CEPH_CLIENT_CAPS_SYNC : 0,
> > > > > +                          cap_used, want, retain, flushing, flush_tid,
> > > > > +                          oldest_flush_tid);
> > > > >                 spin_unlock(&ci->i_ceph_lock);
> > > > > 
> > > > 
> > > > this is too expensive for syncfs case. mds needs to flush journal for
> > > > each dirty inode.  we'd better to track dirty inodes by session, and
> > > > only set the flag when flushing the last inode in session dirty list.
> > 
> > I think this will be more difficult than that...
> > 
> > > Yeah, see the userspace Client::_sync_fs() where we have an internal
> > > flags argument which is set on the last cap in the dirty set and tells
> > > the actual cap message flushing code to set FLAG_SYNC on the
> > > MClientCaps message. I presume the kernel is operating on a similar
> > > principle here?
> > 
> > Not today, but we need it to.
> > 
> > The caps are not tracked on a per-session basis (as Zheng points out),
> > and the locking and ordering of these requests is not as straightforward
> > as it is in the (trivial) libcephfs case. Fixing this will be a lot more
> > invasive than I had originally hoped.
> > 
> > It's also not 100% clear to me how we'll gauge which cap will be
> > "last".  As we move the last cap on the session's list from dirty to
> > flushing state, we can mark it so that it sets the SYNC flag when it
> > goes out, but what happens if we have a process that is actively
> > dirtying other inodes during this? You might never see the per-session
> > list go empty.
> > 
> 
> It's not necessary to be strict 'last'.  just the one before exiting the loop
> 

What loop?

I added a little debugging code and ran Jan's reproducer, and it turns
out that we generally move the inode to flushing state in
ceph_put_wrbuffer_cap_refs while doing writeback under the sync syscall.
By the time we get to any sort of looping in the ceph code, the cap
flushes have already been issued.

My tentative idea was to just check whether this was the last dirty cap
associated with the session and set the flag on it if so, but we don't
really have visibility into that info, because we don't determine the
session until we move the inode from dirty->flushing.

So at this point, I'm still looking at options for fixing this. I really
don't want to just hack this in, as the technical debt in this code is
already substantial and that'll just make it worse.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>