On Fri, 2022-07-01 at 19:58 +0200, Mkrtchyan, Tigran wrote: > > Hi NFS folks, > > reticently we got a kind of DDoS from one of our user: 5k jobs ware > aggressively > reading a handful number of files. Of course we have an overload > protection, > however, such a large number of requests by a single user didn't give > other > users a chance to perform any IO. As we extensively use pNFS, such > user behavior > makes some DSes not available to other users. > > To address this issues, we are looking at some kind of per user > principal > rate limiter. All users will get some IO portion and if there is no > requests > from other users, then a single user can get it all. Not ideal > solution, of > course, but a good starting point. > > So, the question is how tell the aggressive user to back-off? > Delaying the response > will block all other requests from the same host for other users. > Returning > NFS4ERR_DELAY will have the same effect (this is what we do now). > NFSv4.1 session > slots are client wide, thus, any increase or decrease per client id > will > either give more slots to aggressive user or reduce for all other as > well. > > Are there any developments in the direction of per-client (cgroups or > namespaces) > timeout/error handling? Are there a nfs client friendly solutions, > better that > returning NFS4ERR_DELAY? > Here are a few suggestions: 1) Recall the layout from the offending client 2) Define QoS policies for the connections using the kernel Traffic Control mechanisms 3) Use mirroring/replication to allow read access to the same files through multiple data servers. 4) Use NFS re-exporting in order to reduce the load on the data servers. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx