> On Apr 22, 2024, at 7:34 PM, NeilBrown <neilb@xxxxxxx> wrote: > > On Mon, 22 Apr 2024, Chuck Lever wrote: >>> On Mon, Apr 22, 2024 at 12:09:19PM +1000, NeilBrown wrote: >>> The calculation of how many clients the nfs server can manage is only an >>> heuristic. Triggering the laundromat to clean up old clients when we >>> have more than the heuristic limit is valid, but refusing to create new >>> clients is not. Client creation should only fail if there really isn't >>> enough memory available. >>> >>> This is not known to have caused a problem is production use, but >>> testing of lots of clients reports an error and it is not clear that >>> this error is justified. >> >> It is justified, see 4271c2c08875 ("NFSD: limit the number of v4 >> clients to 1024 per 1GB of system memory"). In cases like these, >> the recourse is to add more memory to the test system. > > Does each client really need 1MB? > Obviously we don't want all memory to be used by client state.... > >> >> However, that commit claims that the client is told to retry; I >> don't expect client creation to fail outright. Can you describe the >> failure mode you see? > > The failure mode is repeated client retries that never succeed. I think > an outright failure would be preferable - it would be more clear than > memory must be added. > > The server has N active clients and M courtesy clients. > Triggering reclaim when N+M exceeds a limit and M>0 makes sense. > A hard failure (NFS4ERR_RESOURCE) when N exceeds a limit makes sense. > A soft failure (NFS4ERR_DELAY) while reclaim is running makes sense. > > I don't think a retry while N exceeds the limit makes sense. It’s not optimal, I agree. NFSD has to limit the total number of active and courtesy clients, because otherwise it would be subject to an easy (d)DoS attack, which Dai demonstrated to me before I accepted his patch. A malicious actor or broken clients can continue to create leases on the server until it stops responding. I think failing outright would accomplish the mitigation as well as delaying does, but delaying once or twice gives some slack that allows a mount attempt to succeed eventually even when the server temporarily exceeds the maximum client count. Also IMO there could be a rate-limited pr_warn on the server that fires to indicate when a low-memory situation has been reached. The problem with NFS4ERR_RESOURCE, however, is that NFSv4.1 and newer do not have that status code. All versions of NFS have DELAY/JUKEBOX. I recognize that you are tweaking only SETCLIENTID here, but I think behavior should be consistent for all minor versions of NFSv4. > Do we have a count of active vs courtesy clients? Dai can correct me if I’m wrong, but I believe NFSD maintains a count of both. But only the active leases really matter, becase courtesy clients can be dropped as memory becomes tight. Dropping an active lease would be somewhat more catastrophic. — Chuck Lever