Greetings, I recently started testing a build of 2.6.35 to hopefully relieve some issues we have on our login boxes. Specifically, I was after this commit: http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commit;h=126e216a8730532dfb685205309275f87e3d133e The issue we've run into is that some user loses their credentials, but has a process looping on a read/write of their Kerberized NFSv4 home directory without checking the return value. Not only did this spam logs, but it also prevents rpc.gssd from handling anyone else's logins, effectively taking down the service for anyone not already connected. I was hoping this commit would protect rpc.gssd from any potential flooding of requests, but it all depends on how the user loses their credentials. If their credentials have expired or their caches become corrupt, rpc.gssd returns EKEYEXPIRED and the kernel rate limits the requests to rpc.gssd via negative caching. If the user's credential cache gets destroyed, however, rpc.gssd returns EACCES, and the user process can cause the kernel to hammer rpc.gssd. The kicker here is that pam_krb5 destroys credentials on logout by default, so if someone's using screen or long background processes in their home directory, it's a ticking time bomb waiting to destroy rpc.gssd. That's assuming a benign user, as well. A malicious user could easily kdestroy, wait for their credentials to expire from the cache in the kernel, and start tying up rpc.gssd with failed requests. With this in mind, I initially patched the kernel to negative cache entries with EACCES errors, in addition to EKEYEXPIRED errors. But the more that I thought about it, the more it seemed appropriate to subject all possible errors to negative caching. The underlying question is, is there any possible error from rpc.gssd where it would be appropriate to allow a process to cause another request to rpc.gssd immediately? If there isn't, negative caching all errors seems reasonable. Here's a simple patch implementing the behavior of negative caching of every failed request, as a proof of concept, I guess. With it applied, I have yet to produce a scenario where rpc.gssd becomes unresponsive. Let me know what you think. I'd love to see a fix for this behavior enter the kernel at some point, as it's been rather disruptive on our login boxes lately. diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c index 3835ce3..38bdf90 100644 --- a/net/sunrpc/auth_gss/auth_gss.c +++ b/net/sunrpc/auth_gss/auth_gss.c @@ -362,7 +362,7 @@ gss_handle_downcall_result(struct gss_cred *gss_cred, struct gss_upcall_msg *gss clear_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags); gss_cred_set_ctx(&gss_cred->gc_base, gss_msg->ctx); break; - case -EKEYEXPIRED: + default: set_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags); } gss_cred->gc_upcall_timestamp = jiffies; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html