Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server

dai.ngo@xxxxxxxxxx · Tue, 30 Nov 2021 19:52:10 -0800

On 11/30/21 5:37 AM, Trond Myklebust wrote:
On Mon, 2021-11-29 at 23:22 -0800, dai.ngo@xxxxxxxxxx wrote:
On 11/29/21 8:57 PM, Trond Myklebust wrote:
On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
On Nov 29, 2021, at 11:08 PM, Trond Myklebust
<trondmy@xxxxxxxxxxxxxxx> wrote:

On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@xxxxxxxxxx>
wrote:

On 11/29/21 1:10 PM, Chuck Lever III wrote:

On Nov 29, 2021, at 2:36 PM, Dai Ngo
<dai.ngo@xxxxxxxxxx>
wrote:
On 11/29/21 11:03 AM, Chuck Lever III wrote:
Hello Dai!

On Nov 29, 2021, at 1:32 PM, Dai Ngo
<dai.ngo@xxxxxxxxxx>
wrote:

On 11/29/21 9:30 AM, J. Bruce Fields wrote:
On Mon, Nov 29, 2021 at 09:13:16AM -0800,
dai.ngo@xxxxxxxxxx wrote:
Hi Bruce,

On 11/21/21 7:04 PM, dai.ngo@xxxxxxxxxx wrote:
On 11/17/21 4:34 PM, J. Bruce Fields wrote:
On Wed, Nov 17, 2021 at 01:46:02PM -0800,
dai.ngo@xxxxxxxxxx wrote:
On 11/17/21 9:59 AM,
dai.ngo@xxxxxxxxxx wrote:
On 11/17/21 6:14 AM, J. Bruce Fields
wrote:
On Tue, Nov 16, 2021 at 03:06:32PM -
0800,
dai.ngo@xxxxxxxxxx wrote:
Just a reminder that this patch is
still
waiting for your review.
Yeah, I was procrastinating and
hoping
yo'ud
figure out the pynfs
failure for me....
Last time I ran 4.0 OPEN18 test by
itself
and
it passed. I will run
all OPEN tests together with 5.15-rc7
to
see if
the problem you've
seen still there.
I ran all tests in nfsv4.1 and nfsv4.0
with
courteous and non-courteous
5.15-rc7 server.

Nfs4.1 results are the same for both
courteous
and
non-courteous server:
Of those: 0 Skipped, 0 Failed, 0
Warned,
169
Passed
Results of nfs4.0 with non-courteous
server:
Of those: 8 Skipped, 1 Failed, 0
Warned,
577
Passed
test failed: LOCK24

Results of nfs4.0 with courteous server:
Of those: 8 Skipped, 3 Failed, 0
Warned,
575
Passed
tests failed: LOCK24, OPEN18, OPEN30

OPEN18 and OPEN30 test pass if each is
run by
itself.
Could well be a bug in the tests, I don't
know.
The reason OPEN18 failed was because the test
timed
out waiting for
the reply of an OPEN call. The RPC connection
used
for the test was
configured with 15 secs timeout. Note that
OPEN18
only fails when
the tests were run with 'all' option, this
test
passes if it's run
by itself.

With courteous server, by the time OPEN18
runs,
there
are about 1026
courtesy 4.0 clients on the server and all of
these
clients have opened
the same file X with WRITE access. These
clients
were
created by the
previous tests. After each test completed,
since
4.0
does not have
session, the client states are not cleaned up
immediately on the
server and are allowed to become courtesy
clients.

When OPEN18 runs (about 20 minutes after the
1st
test
started), it
sends OPEN of file X with
OPEN4_SHARE_DENY_WRITE
which causes the
server to check for conflicts with courtesy
clients.
The loop that
checks 1026 courtesy clients for share/access
conflict took less
than 1 sec. But it took about 55 secs, on my
VM,
for
the server
to expire all 1026 courtesy clients.

I modified pynfs to configure the 4.0 RPC
connection
with 60 seconds
timeout and OPEN18 now consistently passed.
The
4.0
test results are
now the same for courteous and non-courteous
server:

8 Skipped, 1 Failed, 0 Warned, 577 Passed

Note that 4.1 tests do not suffer this
timeout
problem because the
4.1 clients and sessions are destroyed after
each
test completes.
Do you want me to send the patch to increase
the
timeout for pynfs?
or is there any other things you think we
should
do?
I don't know.

55 seconds to clean up 1026 clients is about 50ms
per
client, which is
pretty slow.  I wonder why.  I guess it's
probably
updating the stable
storage information.  Is /var/lib/nfs/ on your
server
backed by a hard
drive or an SSD or something else?
My server is a virtualbox VM that has 1 CPU, 4GB
RAM
and
64GB of hard
disk. I think a production system that supports
this
many
clients should
have faster CPUs, faster storage.

I wonder if that's an argument for limiting the
number of
courtesy
clients.
I think we might want to treat 4.0 clients a bit
different
from 4.1
clients. With 4.0, every client will become a
courtesy
client after
the client is done with the export and unmounts it.
It should be safe for a server to purge a client's
lease
immediately
if there is no open or lock state associated with it.
In this case, each client has opened files so there are
open
states
associated with them.

When an NFSv4.0 client unmounts, all files should be
closed
at that
point,
I'm not sure pynfs does proper clean up after each
subtest,
I
will
check. There must be state associated with the client
in
order
for
it to become courtesy client.
Makes sense. Then a synthetic client like pynfs can DoS a
courteous
server.

so the server can wait for the lease to expire and
purge
it
normally. Or am I missing something?
When 4.0 client lease expires and there are still
states
associated
with the client then the server allows this client to
become
courtesy
client.
I think the same thing happens if an NFSv4.1 client
neglects
to
send
DESTROY_SESSION / DESTROY_CLIENTID. Either such a client
is
broken
or malicious, but the server faces the same issue of
protecting
itself from a DoS attack.

IMO you should consider limiting the number of courteous
clients
the server can hold onto. Let's say that number is 1000.
When
the
server wants to turn a 1001st client into a courteous
client,
it
can simply expire and purge the oldest courteous client
on
its
list. Otherwise, over time, the 24-hour expiry will
reduce
the
set of courteous clients back to zero.

What do you think?
Limiting the number of courteous clients to handle the
cases of
broken/malicious 4.1 clients seems reasonable as the last
resort.

I think if a malicious 4.1 clients could mount the server's
export,
opens a file (to create state) and repeats the same with a
different
client id then it seems like some basic security was
already
broken;
allowing unauthorized clients to mount server's exports.
You can do this today with AUTH_SYS. I consider it a genuine
attack
surface.

I think if we have to enforce a limit, then it's only for
handling
of seriously buggy 4.1 clients which should not be the
norm.
The
issue with this is how to pick an optimal number that is
suitable
for the running server which can be a very slow or a very
fast
server.

Note that even if we impose an limit, that does not
completely
solve
the problem with pynfs 4.0 test since its RPC timeout is
configured
with 15 secs which just enough to expire 277 clients based
on
53ms
for each client, unless we limit it ~270 clients which I
think
it's
too low.

This is what I plan to do:

1. do not support 4.0 courteous clients, for sure.
Not supporting 4.0 isn’t an option, IMHO. It is a fully
supported
protocol at this time, and the same exposure exists for 4.1,
it’s
just a little harder to exploit.

If you submit the courteous server patch without support for
4.0,
I
think it needs to include a plan for how 4.0 will be added
later.

Why is there a problem here? The requirements are the same for
4.0
and
4.1 (or 4.2). If the lease under which the courtesy lock was
established has expired, then that courtesy lock must be
released
if
some other client requests a lock that conflicts with the
cached
lock
(unless the client breaks the courtesy framework by renewing
that
original lease before the conflict occurs). Otherwise, it is
completely
up to the server when it decides to actually release the lock.

For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells
the
server when the client is actually done with the lease, making
it
easy
to determine when it is safe to release all the courtesy locks.
However
if the client does not send DESTROY_CLIENTID, then we're in the
same
situation with 4.x (x>0) as we would be with bog standard
NFSv4.0.
The
lease has expired, and so the courtesy locks are liable to
being
dropped.
I agree the situation is the same for all minor versions.

At Hammerspace we have implemented courtesy locks, and our
strategy
is
that when a conflict occurs, we drop the entire set of courtesy
locks
so that we don't have to deal with the "some locks were
revoked"
scenario. The reason is that when we originally implemented
courtesy
locks, the Linux NFSv4 client support for lock revocation was a
lot
less sophisticated than today. My suggestion is that you might
therefore consider starting along this path, and then refining
the
support to make revocation more nuanced once you are confident
that
the
coarser strategy is working as expected.
Dai’s implementation does all that, and takes the coarser
approach at
the moment. There are plans to explore the more nuanced behavior
(by
revoking only the conflicting lock instead of dropping the whole
lease) after this initial work is merged.

The issue is there are certain pathological client behaviors
(whether
malicious or accidental) that can run the server out of
resources,
since it is holding onto lease state for a much longer time. We
are
simply trying to design a lease garbage collection scheme to meet
that challenge.

I think limiting the number of courteous clients is a simple way
to
do this, but we could also shorten the courtesy lifetime as more
clients enter that state, to ensure that they don’t overrun the
server’s memory. Another approach might be to add a shrinker that
purges the oldest courteous clients when the server comes under
memory pressure.

We already have a scanner that tries to release all client state
after
1 lease period. Just extend that to do it after 10 lease periods.
If a
network partition hasn't recovered after 10 minutes, you probably
have
bigger problems.
Currently the courteous server allows 24hr for the network partition
to
heal before releasing all client state. That seems to be excessive
but
it was suggested for longer network partition conditions when
switch/routers
being repaired/upgraded.

You can limit the number of clients as well, but that leads into a
rats
nest of other issues that have nothing to do with courtesy locks
and
everything to do with the fact that any client can hold a lot of
state.
The issue we currently have with courteous server and pynfs 4.0 tests
is the number of courteous 4.0 clients the server has to expire when
a
share reservation conflict occurs when servicing the OPEN. Each
client
owns only few state in this case so we think the server spent most
time
for deleting client's record in /var/lib/nfs. This is why we plan to
limit the number of courteous clients for now. As a side effect, it
might
also help to reduce resource consumption too.
Then kick off a thread or work item to do that asynchronously in the
background, and return NFS4ERR_DELAY to the clients that were trying to
grab locks in the meantime.

Thanks Trond, I think this is a reasonable approach. The behavior would
be similar to a delegation recall during the OPEN.

My plan is:

1. If the number of conflict clients is less than 100 (some numbers that
cover realistic usage) then release all their state synchronously in
the OPEN call, and returns NFS4_OK to the NFS client. Most of conflicts
should be handled by this case.

2. If the number of conflict clients is more than 100 then release the
state of the 1st 100 clients as in (1) and trigger the laundromat thread
to release state of the rest of the conflict clients, and return
NFS4ERR_DELAY to the NFS client. This should be a rare condition.

-Dai

The above process is hardly just confined to NFSv4.0 clients. If there
is a network partition, then the exact same record deleting needs to be
applied to all NFSv4.1 and NFSv4.2 clients that hold locks and are
unable to renew their leases, so you might as well make it work for
everyone.