On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote: > > > On Nov 29, 2021, at 11:08 PM, Trond Myklebust > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote: > > > > > > > > On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> > > > > > wrote: > > > > > > > > > > > > > On 11/29/21 1:10 PM, Chuck Lever III wrote: > > > > > > > > > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On 11/29/21 11:03 AM, Chuck Lever III wrote: > > > > > > > Hello Dai! > > > > > > > > > > > > > > > > > > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo > > > > > > > > <dai.ngo@xxxxxxxxxx> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote: > > > > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800, > > > > > > > > > dai.ngo@xxxxxxxxxx wrote: > > > > > > > > > > Hi Bruce, > > > > > > > > > > > > > > > > > > > > On 11/21/21 7:04 PM, dai.ngo@xxxxxxxxxx wrote: > > > > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote: > > > > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800, > > > > > > > > > > > > dai.ngo@xxxxxxxxxx wrote: > > > > > > > > > > > > > On 11/17/21 9:59 AM, > > > > > > > > > > > > > dai.ngo@xxxxxxxxxx wrote: > > > > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields wrote: > > > > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -0800, > > > > > > > > > > > > > > > dai.ngo@xxxxxxxxxx wrote: > > > > > > > > > > > > > > > > Just a reminder that this patch is > > > > > > > > > > > > > > > > still > > > > > > > > > > > > > > > > waiting for your review. > > > > > > > > > > > > > > > Yeah, I was procrastinating and hoping > > > > > > > > > > > > > > > yo'ud > > > > > > > > > > > > > > > figure out the pynfs > > > > > > > > > > > > > > > failure for me.... > > > > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by itself > > > > > > > > > > > > > > and > > > > > > > > > > > > > > it passed. I will run > > > > > > > > > > > > > > all OPEN tests together with 5.15-rc7 to > > > > > > > > > > > > > > see if > > > > > > > > > > > > > > the problem you've > > > > > > > > > > > > > > seen still there. > > > > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0 with > > > > > > > > > > > > > courteous and non-courteous > > > > > > > > > > > > > 5.15-rc7 server. > > > > > > > > > > > > > > > > > > > > > > > > > > Nfs4.1 results are the same for both > > > > > > > > > > > > > courteous > > > > > > > > > > > > > and > > > > > > > > > > > > > non-courteous server: > > > > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0 Warned, > > > > > > > > > > > > > > 169 > > > > > > > > > > > > > > Passed > > > > > > > > > > > > > Results of nfs4.0 with non-courteous server: > > > > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0 Warned, > > > > > > > > > > > > > > 577 > > > > > > > > > > > > > > Passed > > > > > > > > > > > > > test failed: LOCK24 > > > > > > > > > > > > > > > > > > > > > > > > > > Results of nfs4.0 with courteous server: > > > > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0 Warned, > > > > > > > > > > > > > > 575 > > > > > > > > > > > > > > Passed > > > > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30 > > > > > > > > > > > > > > > > > > > > > > > > > > OPEN18 and OPEN30 test pass if each is run by > > > > > > > > > > > > > itself. > > > > > > > > > > > > Could well be a bug in the tests, I don't know. > > > > > > > > > > > The reason OPEN18 failed was because the test > > > > > > > > > > > timed > > > > > > > > > > > out waiting for > > > > > > > > > > > the reply of an OPEN call. The RPC connection > > > > > > > > > > > used > > > > > > > > > > > for the test was > > > > > > > > > > > configured with 15 secs timeout. Note that OPEN18 > > > > > > > > > > > only fails when > > > > > > > > > > > the tests were run with 'all' option, this test > > > > > > > > > > > passes if it's run > > > > > > > > > > > by itself. > > > > > > > > > > > > > > > > > > > > > > With courteous server, by the time OPEN18 runs, > > > > > > > > > > > there > > > > > > > > > > > are about 1026 > > > > > > > > > > > courtesy 4.0 clients on the server and all of > > > > > > > > > > > these > > > > > > > > > > > clients have opened > > > > > > > > > > > the same file X with WRITE access. These clients > > > > > > > > > > > were > > > > > > > > > > > created by the > > > > > > > > > > > previous tests. After each test completed, since > > > > > > > > > > > 4.0 > > > > > > > > > > > does not have > > > > > > > > > > > session, the client states are not cleaned up > > > > > > > > > > > immediately on the > > > > > > > > > > > server and are allowed to become courtesy > > > > > > > > > > > clients. > > > > > > > > > > > > > > > > > > > > > > When OPEN18 runs (about 20 minutes after the 1st > > > > > > > > > > > test > > > > > > > > > > > started), it > > > > > > > > > > > sends OPEN of file X with OPEN4_SHARE_DENY_WRITE > > > > > > > > > > > which causes the > > > > > > > > > > > server to check for conflicts with courtesy > > > > > > > > > > > clients. > > > > > > > > > > > The loop that > > > > > > > > > > > checks 1026 courtesy clients for share/access > > > > > > > > > > > conflict took less > > > > > > > > > > > than 1 sec. But it took about 55 secs, on my VM, > > > > > > > > > > > for > > > > > > > > > > > the server > > > > > > > > > > > to expire all 1026 courtesy clients. > > > > > > > > > > > > > > > > > > > > > > I modified pynfs to configure the 4.0 RPC > > > > > > > > > > > connection > > > > > > > > > > > with 60 seconds > > > > > > > > > > > timeout and OPEN18 now consistently passed. The > > > > > > > > > > > 4.0 > > > > > > > > > > > test results are > > > > > > > > > > > now the same for courteous and non-courteous > > > > > > > > > > > server: > > > > > > > > > > > > > > > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed > > > > > > > > > > > > > > > > > > > > > > Note that 4.1 tests do not suffer this timeout > > > > > > > > > > > problem because the > > > > > > > > > > > 4.1 clients and sessions are destroyed after each > > > > > > > > > > > test completes. > > > > > > > > > > Do you want me to send the patch to increase the > > > > > > > > > > timeout for pynfs? > > > > > > > > > > or is there any other things you think we should > > > > > > > > > > do? > > > > > > > > > I don't know. > > > > > > > > > > > > > > > > > > 55 seconds to clean up 1026 clients is about 50ms per > > > > > > > > > client, which is > > > > > > > > > pretty slow. I wonder why. I guess it's probably > > > > > > > > > updating the stable > > > > > > > > > storage information. Is /var/lib/nfs/ on your server > > > > > > > > > backed by a hard > > > > > > > > > drive or an SSD or something else? > > > > > > > > My server is a virtualbox VM that has 1 CPU, 4GB RAM > > > > > > > > and > > > > > > > > 64GB of hard > > > > > > > > disk. I think a production system that supports this > > > > > > > > many > > > > > > > > clients should > > > > > > > > have faster CPUs, faster storage. > > > > > > > > > > > > > > > > > I wonder if that's an argument for limiting the > > > > > > > > > number of > > > > > > > > > courtesy > > > > > > > > > clients. > > > > > > > > I think we might want to treat 4.0 clients a bit > > > > > > > > different > > > > > > > > from 4.1 > > > > > > > > clients. With 4.0, every client will become a courtesy > > > > > > > > client after > > > > > > > > the client is done with the export and unmounts it. > > > > > > > It should be safe for a server to purge a client's lease > > > > > > > immediately > > > > > > > if there is no open or lock state associated with it. > > > > > > In this case, each client has opened files so there are > > > > > > open > > > > > > states > > > > > > associated with them. > > > > > > > > > > > > > When an NFSv4.0 client unmounts, all files should be > > > > > > > closed > > > > > > > at that > > > > > > > point, > > > > > > I'm not sure pynfs does proper clean up after each subtest, > > > > > > I > > > > > > will > > > > > > check. There must be state associated with the client in > > > > > > order > > > > > > for > > > > > > it to become courtesy client. > > > > > Makes sense. Then a synthetic client like pynfs can DoS a > > > > > courteous > > > > > server. > > > > > > > > > > > > > > > > > so the server can wait for the lease to expire and purge > > > > > > > it > > > > > > > normally. Or am I missing something? > > > > > > When 4.0 client lease expires and there are still states > > > > > > associated > > > > > > with the client then the server allows this client to > > > > > > become > > > > > > courtesy > > > > > > client. > > > > > I think the same thing happens if an NFSv4.1 client neglects > > > > > to > > > > > send > > > > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is > > > > > broken > > > > > or malicious, but the server faces the same issue of > > > > > protecting > > > > > itself from a DoS attack. > > > > > > > > > > IMO you should consider limiting the number of courteous > > > > > clients > > > > > the server can hold onto. Let's say that number is 1000. When > > > > > the > > > > > server wants to turn a 1001st client into a courteous client, > > > > > it > > > > > can simply expire and purge the oldest courteous client on > > > > > its > > > > > list. Otherwise, over time, the 24-hour expiry will reduce > > > > > the > > > > > set of courteous clients back to zero. > > > > > > > > > > What do you think? > > > > > > > > Limiting the number of courteous clients to handle the cases of > > > > broken/malicious 4.1 clients seems reasonable as the last > > > > resort. > > > > > > > > I think if a malicious 4.1 clients could mount the server's > > > > export, > > > > opens a file (to create state) and repeats the same with a > > > > different > > > > client id then it seems like some basic security was already > > > > broken; > > > > allowing unauthorized clients to mount server's exports. > > > > > > You can do this today with AUTH_SYS. I consider it a genuine > > > attack > > > surface. > > > > > > > > > > I think if we have to enforce a limit, then it's only for > > > > handling > > > > of seriously buggy 4.1 clients which should not be the norm. > > > > The > > > > issue with this is how to pick an optimal number that is > > > > suitable > > > > for the running server which can be a very slow or a very fast > > > > server. > > > > > > > > Note that even if we impose an limit, that does not completely > > > > solve > > > > the problem with pynfs 4.0 test since its RPC timeout is > > > > configured > > > > with 15 secs which just enough to expire 277 clients based on > > > > 53ms > > > > for each client, unless we limit it ~270 clients which I think > > > > it's > > > > too low. > > > > > > > > This is what I plan to do: > > > > > > > > 1. do not support 4.0 courteous clients, for sure. > > > > > > Not supporting 4.0 isn’t an option, IMHO. It is a fully supported > > > protocol at this time, and the same exposure exists for 4.1, it’s > > > just a little harder to exploit. > > > > > > If you submit the courteous server patch without support for 4.0, > > > I > > > think it needs to include a plan for how 4.0 will be added later. > > > > > > > > > > > Why is there a problem here? The requirements are the same for 4.0 > > and > > 4.1 (or 4.2). If the lease under which the courtesy lock was > > established has expired, then that courtesy lock must be released > > if > > some other client requests a lock that conflicts with the cached > > lock > > (unless the client breaks the courtesy framework by renewing that > > original lease before the conflict occurs). Otherwise, it is > > completely > > up to the server when it decides to actually release the lock. > > > > For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the > > server when the client is actually done with the lease, making it > > easy > > to determine when it is safe to release all the courtesy locks. > > However > > if the client does not send DESTROY_CLIENTID, then we're in the > > same > > situation with 4.x (x>0) as we would be with bog standard NFSv4.0. > > The > > lease has expired, and so the courtesy locks are liable to being > > dropped. > > I agree the situation is the same for all minor versions. > > > > At Hammerspace we have implemented courtesy locks, and our strategy > > is > > that when a conflict occurs, we drop the entire set of courtesy > > locks > > so that we don't have to deal with the "some locks were revoked" > > scenario. The reason is that when we originally implemented > > courtesy > > locks, the Linux NFSv4 client support for lock revocation was a lot > > less sophisticated than today. My suggestion is that you might > > therefore consider starting along this path, and then refining the > > support to make revocation more nuanced once you are confident that > > the > > coarser strategy is working as expected. > > Dai’s implementation does all that, and takes the coarser approach at > the moment. There are plans to explore the more nuanced behavior (by > revoking only the conflicting lock instead of dropping the whole > lease) after this initial work is merged. > > The issue is there are certain pathological client behaviors (whether > malicious or accidental) that can run the server out of resources, > since it is holding onto lease state for a much longer time. We are > simply trying to design a lease garbage collection scheme to meet > that challenge. > > I think limiting the number of courteous clients is a simple way to > do this, but we could also shorten the courtesy lifetime as more > clients enter that state, to ensure that they don’t overrun the > server’s memory. Another approach might be to add a shrinker that > purges the oldest courteous clients when the server comes under > memory pressure. > > We already have a scanner that tries to release all client state after 1 lease period. Just extend that to do it after 10 lease periods. If a network partition hasn't recovered after 10 minutes, you probably have bigger problems. You can limit the number of clients as well, but that leads into a rats nest of other issues that have nothing to do with courtesy locks and everything to do with the fact that any client can hold a lot of state. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx