Re: Question about open(CLAIM_FH)

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 30 Apr 2019 18:56:24 +0000

On Tue, 2019-04-30 at 14:44 -0400, Scott Mayhew wrote:
> On Thu, 18 Apr 2019, Trond Myklebust wrote:
> 
> > On Thu, 2019-04-18 at 16:43 -0400, Scott Mayhew wrote:
> > > On Thu, 18 Apr 2019, Trond Myklebust wrote:
> > > 
> > > > Hi Scott,
> > > > 
> > > > On Thu, 2019-04-18 at 09:37 -0400, Scott Mayhew wrote:
> > > > > When the client does an open(CLAIM_FH) and the server already
> > > > > has
> > > > > open
> > > > > state for that open owner and file, what's supposed to
> > > > > happen?
> > > > > Currently the server returns the existing stateid with the
> > > > > seqid
> > > > > bumped,
> > > > > but it looks like the client is expecting a new stateid (I'm
> > > > > seeing
> > > > > the
> > > > > state manager spending a lot of time waiting in
> > > > > nfs_set_open_stateid_locked() due to NFS_STATE_CHANGE_WAIT
> > > > > being
> > > > > set
> > > > > in
> > > > > the state flags by nfs_need_update_open_stateid()).
> > > > > 
> > > > > Looking at rfc5661 section 18.16.3, I see:
> > > > > 
> > > > >    | CLAIM_NULL, CLAIM_FH | For the client, this is a new
> > > > > OPEN
> > > > > request |
> > > > >    |                      | and there is no previous state
> > > > > associated  |
> > > > >    |                      | with the file for the
> > > > > client.  With        |
> > > > >    |                      | CLAIM_NULL, the file is
> > > > > identified by
> > > > > the  |
> > > > >    |                      | current filehandle and the
> > > > > specified       |
> > > > >    |                      | component name.  With CLAIM_FH
> > > > > (new
> > > > > to     |
> > > > >    |                      | NFSv4.1), the file is identified
> > > > > by
> > > > > just   |
> > > > >    |                      | the current filehandle.  
> > > > > 
> > > > > So it seems like maybe the server should be tossing the old
> > > > > state
> > > > > and
> > > > > returning a new stateid?
> > > > > 
> > > > 
> > > > No. As far as the protocol is concerned, the only difference
> > > > between
> > > > CLAIM_NULL and CLAIM_FH is through how the client identifies
> > > > the
> > > > file
> > > > (in the first case, through an implicit lookup, and in the
> > > > second
> > > > case
> > > > through a file handle). The client should be free to intermix
> > > > the
> > > > two
> > > > types of OPEN, and it should expect the resulting stateids to
> > > > depend
> > > > only on whether or not the open_owner matches. If the
> > > > open_owner
> > > > matches an existing stateid, then that stateid is bumped and
> > > > returned.
> > > > 
> > > > I'm not aware of any expectation in the client that this should
> > > > not
> > > > be
> > > > the case, so if you are seeing different behaviour, then
> > > > something
> > > > else
> > > > must be at work here. Is the client perhaps mounting the same
> > > > filesystem in two different places in such a way that the super
> > > > block
> > > > is not being shared?
> > > 
> > > No, it's just a single 4.1 mount w/ the default mount options.
> > > 
> > > For a bit of background, I've been trying to track down a problem
> > > in
> > > RHEL where the SEQ4_STATUS_RECALLABLE_STATE_REVOKED flags is
> > > getting
> > > permanently set because the nfs4_client->cl_revoked list on the
> > > server
> > > is non-empty... yet there's no longer open state on the client. 
> > > 
> > > I can reproduce it pretty easily in RHEL using 2 VMs, each with
> > > 2-4
> > > CPUs
> > > and 4-8G of memory.  The server has 64 nfsd threads and a 15
> > > second
> > > lease time.
> > > 
> > > On the client I'm running the following to add a 10ms delay to
> > > CB_RECALL
> > > replies:
> > > # stap -gve 'global count = 0; probe
> > > module("nfsv4").function("nfs4_callback_recall") { printf("%s:
> > > %d\n",
> > > ppfunc(), ++count); mdelay(10); }'
> > > 
> > > then in another window I open a bunch of files:
> > > # for i in `seq -w 1 5000`; do sleep 2m </mnt/t/dir1/file.$i &
> > > done
> > > 
> > > (Note: I already created the files ahead of time)
> > > 
> > > As soon as the bash prompt returns on the client, I run the
> > > following
> > > on
> > > the server:
> > > # for i in `seq -w 1 5000`; do date >/export/dir1/file.$i & done
> > > 
> > > At that point, any further SEQUENCE ops will have the recallable
> > > state
> > > revoked flag set on the client until the fs is unmounted.
> > > 
> > > If I run the same steps on Fedora clients with recent kernels, I
> > > don't
> > > have the problem with the recallable state revoked flag, but I'm
> > > getting
> > > some other strangeness.  Everything starts out fine with
> > > nfs_reap_expired_delegations() doing TEST_STATEID and
> > > FREE_STATEID,
> > > but
> > > once the state manager starts callings nfs41_open_expired(),
> > > things
> > > sort
> > > of grind to a halt and I see 1 OPEN and 1 or 2 TEST_STATEID ops
> > > every
> > > 5
> > > seconds in wireshark.  It stays that way until the files are
> > > closed
> > > on
> > > the client, when I see a slew of DELEGRETURNs and
> > > FREE_STATEIDs...
> > > but
> > > I'm only seeing 3 or 4 CLOSE ops.  If I poke around in crash on
> > > the
> > > server, I see a ton of open stateids:
> > > 
> > > crash> epython fs/nfsd/print-client-state-info.py
> > > nfsd_net = 0xffff93e473511000
> > >         nfs4_client = 0xffff93e3f7954980
> > >                 nfs4_stateowner = 0xffff93e4058cc360 num_stateids
> > > =
> > > 4997 <---- only 3 CLOSE ops were received
> > >                 num_openowners = 1
> > >                 num_layouts = 0
> > >                 num_delegations = 0
> > >                 num_sessions = 1
> > >                 num_copies = 0
> > >                 num_revoked = 0
> > >                 cl_cb_waitq_qlen = 0
> > > 
> > > Those stateids stick around until the fs is unmounted (and the
> > > DESTROY_STATEID ops return NFS4ERR_CLIENTID_BUSY while doing so).
> > > 
> > > Both VMs are running 5.0.6-200.fc29.x86_64, but the server also
> > > has
> > > the
> > > "nfsd: Don't release the callback slot unless it was actually
> > > held"
> > > patch you sent a few weeks ago as well as the "nfsd: CB_RECALL
> > > can
> > > race
> > > with FREE_STATEID" patch I sent today.
> > 
> > Are the calls to nfs41_open_expired() succeeding? It sounds like
> > they
> > might not be.
> 
> They're succeeding, they're just taking 5 seconds each. 
> 
> To make matters worse, due to the aggressive lease time on the
> server,
> the client was doing a SEQUENCE op periodically while the OPENs were
> occurring.  The SEQUENCE replies also had the
> RECALLABLE_STATE_REVOKED
> flag set, and since they weren't coming from the state manager the
> nfs4_slot->privileged field was unset, so we'd wind up calling
> nfs41_handle_recallable_state_revoked() again, undoing what little
> progress was made.  Bumping the lease time to 20 seconds made that go
> away, but the overall problem is still there.
> 
> Is it really necessary for nfs41_handle_recallable_state_revoked() to
> mark all open state as needing recovery?  After all, the only state
> that
> can be revoked are layouts and delegations.  If I change
> nfs41_handle_recallable_state_revoked(), so that instead of calling
> nfs41_handle_some_state_revoked() we instead call
> nfs_mark_test_expired_all_delegations() and
> nfs4_schedule_state_manager(), then the problem seems to go away.

That would be OK with me. That code probably pre-dates the existence of
nfs_mark_test_expired_all_delegations().

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx