Hi Andy- Thanks for taking the time to discuss this with me. I've copied linux-nfs to make this e-mail also an upstream bug report. As we saw in the network capture, recovery of GSS contexts after a server reboot fails in certain cases with NFSv4.0 and NFSv4.1 mount points. The reproducer is a simple program that generates one NFS WRITE periodically, run while the server repeatedly reboots (or one cluster head fails over to the other and back). The goal of the reproducer is to identify problems with state recovery without a lot of other I/O going on to clutter up the network capture. In the failing case, sec=krb5 is specified on the mount point, and the reproducer is run as root. We've found this combination fails with both NFSv4.0 and NFSv4.1. At mount time, the client establishes a GSS context for lease management operations, which is bound to the client's NFS service principal and uses GSS service "integrity." Call this GSS context 1. When the reproducer starts, a second GSS context is established for NFS operations associated with that user. Since the reproducer is running as root, this context is also bound to the client's NFS service principal, but it uses the GSS service "none" (reflecting the explicit request for "sec=krb5"). Call this GSS context 2. After the server reboots, the client re-establishes a TCP connection with the server, and performs a RENEW operation using context 1. Thanks to the server reboot, contexts 1 and 2 are now stale. The server thus rejects the RPC with RPCSEC_GSS_CTXPROBLEM. The client performs a GSS_INIT_SEC_CONTEXT via an NFSv4 NULL operation. Call this GSS context 3. Interestingly, the client does not resend the RENEW operation at this point (if it did, we wouldn't see this problem at all). The client then attempts to resume the reproducer workload. It sends an NFSv4 WRITE operation, using the first available GSS context in UID 0's credential cache, which is context 3, already bound to the client's NFS service principal. But GSS service "none" is used for this operation, since it is on behalf of the mount where sec=krb5 was specified. The RPC is accepted, but the server reports NFS4ERR_STALE_STATEID, since it has recently rebooted. The client responds by attempting state recovery. The first operation it tries is another RENEW. Since this is a lease management operation, the client looks in UID 0's credential cache again and finds the recently established context 3. It tries the RENEW operation using GSS context 3 with GSS service "integrity." The server rejects the RENEW RPC with AUTH_FAILED, and the client reports that "check lease failed" and terminates state recovery. The client re-drives the WRITE operation with the stale stateid with predictable results. The client again tries to recover state by sending a RENEW, and still uses the same GSS context 3 with service "integrity" and gets the same result. A (perhaps slow-motion) STALE_STATEID loop ensues, and the client mount point is deadlocked. Your analysis was that because the reproducer is run as root, both the reproducer's I/O operations, and lease management operations, attempt to use the same GSS context in UID 0's credential cache, but each uses different GSS services. The key issue seems to be why, when the mount is first established, the client is correctly able to establish two separate GSS contexts for UID 0; but after a server reboot, the client attempts to use the same GSS context with two different GSS services. One solution is to introduce a quick check before a context is used to see if the GSS service bound to it matches the GSS service that the caller intends to use. I'm not sure how that can be done without exposing a window where another caller requests the use of a GSS context and grabs the fresh one, before it can be used by our first caller and bound to its desired GSS service. Other solutions might be to somehow isolate the credential cache used for lease management operations, or to split credential caches by GSS service. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html