Hi NFS folks, reticently we got a kind of DDoS from one of our user: 5k jobs ware aggressively reading a handful number of files. Of course we have an overload protection, however, such a large number of requests by a single user didn't give other users a chance to perform any IO. As we extensively use pNFS, such user behavior makes some DSes not available to other users. To address this issues, we are looking at some kind of per user principal rate limiter. All users will get some IO portion and if there is no requests from other users, then a single user can get it all. Not ideal solution, of course, but a good starting point. So, the question is how tell the aggressive user to back-off? Delaying the response will block all other requests from the same host for other users. Returning NFS4ERR_DELAY will have the same effect (this is what we do now). NFSv4.1 session slots are client wide, thus, any increase or decrease per client id will either give more slots to aggressive user or reduce for all other as well. Are there any developments in the direction of per-client (cgroups or namespaces) timeout/error handling? Are there a nfs client friendly solutions, better that returning NFS4ERR_DELAY? Thanks in advance, Tigran.
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature