Re: still seeing single client NFS4ERR_DELAY / CB_RECALL

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 4 Sep 2020 18:27:50 -0400

> On Sep 4, 2020, at 6:01 PM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> 
> On Mon, Aug 24, 2020 at 11:42:18AM -0400, Chuck Lever wrote:
>> 
>> 
>>> On Aug 24, 2020, at 10:22 AM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>> 
>>> On Mon, Aug 24, 2020 at 09:39:31AM -0400, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Aug 19, 2020, at 5:29 PM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On Tue, Aug 18, 2020 at 05:26:26PM -0400, Chuck Lever wrote:
>>>>>> 
>>>>>>> On Aug 17, 2020, at 6:20 PM, Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> On Sun, Aug 16, 2020 at 04:46:00PM -0400, Chuck Lever wrote:
>>>>>>> 
>>>>>>>> In order of application:
>>>>>>>> 
>>>>>>>> 5920afa3c85f ("nfsd: hook nfsd_commit up to the nfsd_file cache")
>>>>>>>> 961.68user 5252.40system 20:12.30elapsed 512%CPU, 2541 DELAY errors
>>>>>>>> These results are similar to v5.3.
>>>>>>>> 
>>>>>>>> fd4f83fd7dfb ("nfsd: convert nfs4_file->fi_fds array to use nfsd_files")
>>>>>>>> Does not build
>>>>>>>> 
>>>>>>>> eb82dd393744 ("nfsd: convert fi_deleg_file and ls_file fields to nfsd_file")
>>>>>>>> 966.92user 5425.47system 33:52.79elapsed 314%CPU, 1330 DELAY errors
>>>>>>>> 
>>>>>>>> Can you take a look and see if there's anything obvious?
>>>>>>> 
>>>>>>> Unfortunately nothing about the file cache code is very obvious to me.
>>>>>>> I'm looking at it....
>>>>>>> 
>>>>>>> It adds some new nfserr_jukebox returns in nfsd_file_acquire.  Those
>>>>>>> mostly look like kmalloc failures, the one I'm not sure about is the
>>>>>>> NFSD_FILE_HASHED check.
>>>>>>> 
>>>>>>> Or maybe it's the lease break there.
>>>>>> 
>>>>>> nfsd_file_acquire() always calls fh_verify() before it invokes nfsd_open().
>>>>>> Replacing nfs4_get_vfs_file's nfsd_open() call with nfsd_file_acquire() adds
>>>>>> almost 10 million fh_verify() calls to my test run.
>>>>> 
>>>>> Checking out the code as of fd4f83fd7dfb....
>>>>> 
>>>>> nfsd_file_acquire() calls nfsd_open_verified().
>>>>> 
>>>>> And nfsd_open() is basically just fh_verify()+nfsd_open_verified().
>>>>> 
>>>>> So it doesn't look like the replacement of nfsd_open() by
>>>>> nfsd_file_acquire() should have changed the number of fh_verify() calls.
>>>> 
>>>> I see a lot more vfs_setlease() failures after fd4f83fd7dfb.
>>>> check_conflicting_open() fails because "inode is open for write":
>>>> 
>>>> 1780         if (arg == F_RDLCK)
>>>> 1781                 return inode_is_open_for_write(inode) ? -EAGAIN : 0;
>>>> 
>>>> The behavior on the wire is that the server simply doesn't hand out
>>>> many delegations.
>>>> 
>>>> NFSv4 OPEN uses nfsd_file_acquire() now, but I don't see CLOSE
>>>> releasing the cached file descriptor. Wouldn't that cached
>>>> descriptor conflict with subsequent OPENs?
>>> 
>>> Could be, yes.
>>> 
>>> That also reminds me of this patch, did I already send it to you?
>> 
>> I don't have this one. I can try it.
> 
> No difference, I take it?

It helps a little, but it doesn't appear to address the regression.

It does seem like a reasonable change to take anyway. You could
take both of the earlier patches in this thread for v5.10, IMO.

> There could also be something wrong with nfsd4_check_conflicting_opens()
> that's preventing delegations when it shouldn't.

I've added some instrumentation there. It looks like @writes
is still positive because of a previous cached open. The NFS
request stream goes something like this:

OPEN(CREATE) access=WRITE name=yada
WRITE
CLOSE
... here is where the cached WR_ONLY open remains

OPEN(NOCREATE) access=READ name=yada
... conflict detected, no soup for you.

I checked, and before the filecache support was added, the server
does offer a read delegation in this case.

> There might also be some way fh_verify() could be smarter.  There's a
> big comment there explaining why we repeat the permission checks each
> time, but maybe we could keep a flag somewhere that tracks whether we
> really need to call nfsd_setuser again.

I'll have a look at that more closely once we are clear of this
performance regression. The architecture of how fh_verify is
used has been around forever, and it seems like a it's a
separate issue.

I've added some tracepoints that report fh_verify calls and each
change to the current FH. Here's a single OPEN call on the server.
Note the number of fh_verify calls and how long they take
(timestamps are seconds since boot). This is from a single-thread
test on the server, so there's no other activity going on.

6398.079629: nfsd_compound:        xid=0x59f1891e opcnt=5
6398.079632: nfsd_fh_current:      xid=0x59f1891e fh_hash=0x0a393779
6398.079685: nfsd_fh_verify:       xid=0x59f1891e fh_hash=0x0a393779 type=NONE access=BYPASS_GSS status=0
6398.079687: nfsd_compound_status: xid=0x59f1891e op=1/5 OP_PUTFH status=0
6398.079689: nfsd_open_args:       xid=0x59f1891e seqid=12 type=NOCREATE claim=NULL share=READ name=Makefile
6398.079753: nfsd_fh_verify:       xid=0x59f1891e fh_hash=0x0a393779 type=DIR access=EXEC status=0
6398.079789: nfsd_fh_verify:       xid=0x59f1891e fh_hash=0x2c049d00 type=REG access=READ|READ_IF_EXEC status=0
6398.079830: nfsd_fh_verify:       xid=0x59f1891e fh_hash=0x2c049d00 type=REG access=READ|OWNER_OVERRIDE status=0
6398.079833: nfsd_file_acquire:    xid=0x59f1891e hash=0xe28 inode=0xffff88873f4777b0 may_flags=READ ref=2 nf_flags=HASHED|REFERENCED nf_may=READ nf_file=0xffff88872cc53e00 status=0
6398.079868: generic_add_lease:    dev=0x0:0x23 ino=0xe42af wcount=0 rcount=1 icount=2 fl_owner=0xffff88871914cfd8 fl_flags=FL_DELEG fl_type=F_RDLCK
6398.079876: locks_get_lock_context: dev=0x0:0x23 ino=0xe42af type=F_RDLCK ctx=0xffff8886d8a43b58
6398.079881: nfsd_deleg_open:      client 5f529c2f:b9aedadc stateid 002471d9:00000001
6398.079883: nfsd_deleg_none:      client 5f529c2f:b9aedadc stateid 002471d8:00000001
6398.079901: nfsd_fh_current:      xid=0x59f1891e fh_hash=0x2c049d00
6398.079904: nfsd_compound_status: xid=0x59f1891e op=2/5 OP_OPEN status=0
6398.079906: nfsd_compound_status: xid=0x59f1891e op=3/5 OP_GETFH status=0
6398.079941: nfsd_fh_verify:       xid=0x59f1891e fh_hash=0x2c049d00 type=NONE access= status=0
6398.079947: nfsd_compound_status: xid=0x59f1891e op=4/5 OP_ACCESS status=0
6398.079948: nfsd_get_fattr4:      xid=0x59f1891e bm[0]=TYPE|CHANGE|SIZE|FSID|FILEID bm[1]=MODE|NUMLINKS|OWNER|OWNER_GROUP|RAWDEV|SPACE_USED|TIME_ACCESS|TIME_METADATA|TIME_MODIFY|MOUNTED_ON_FILEID bm[2]=
6398.079980: nfsd_fh_verify:       xid=0x59f1891e fh_hash=0x2c049d00 type=NONE access= status=0
6398.079985: nfsd_compound_status: xid=0x59f1891e op=5/5 OP_GETATTR status=0

> Based on your and Frank's experiences I'm also sympathetic to the idea
> that maybe the filehandle cache just gets in the way in the v4 case.

I think it does get in the way for NFSv4. Clearly if there are NFSv3
users, the filecache is a good way to block handing out delegations.

For NFSv4, once there are no clients (or local users) using a file,
the server should be permitted to offer delegations on it. I don't
see any benefit to holding onto file descriptors that clients say
explicitly that they are done using.

My worry is that after the 5 kernel releases since this code went in,
it won't be a simple set of reverts to back out NFSv4 filecache usage.

> --b.
> 
>>> Author: J. Bruce Fields <bfields@xxxxxxxxxx>
>>> Date:   Fri Jul 17 18:54:54 2020 -0400
>>> 
>>>   nfsd: Cache R, RW, and W opens separately
>>> 
>>>   The nfsd open code has always kept separate read-only, read-write, and
>>>   write-only opens as necessary to ensure that when a client closes or
>>>   downgrades, we don't retain more access than necessary.
>>> 
>>>   Honestly, I'm not sure if that's completely necessary, but I'd rather
>>>   stick to that behavior.
>>> 
>>>   Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx>
>>> 
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 82198d747c4c..4b6f70e0d987 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -891,7 +891,7 @@ nfsd_file_find_locked(struct inode *inode, unsigned int may_flags,
>>> 
>>> 	hlist_for_each_entry_rcu(nf, &nfsd_file_hashtbl[hashval].nfb_head,
>>> 				 nf_node, lockdep_is_held(&nfsd_file_hashtbl[hashval].nfb_lock)) {
>>> -		if ((need & nf->nf_may) != need)
>>> +		if (nf->nf_may != need)
>>> 			continue;
>>> 		if (nf->nf_inode != inode)
>>> 			continue;
>> 
>> --
>> Chuck Lever

--
Chuck Lever