On Sep 10, 2012, at 2:57 PM, Jeff Layton wrote: > On Thu, 6 Sep 2012 15:54:07 -0400 > andros@xxxxxxxxxx wrote: > >> From: Andy Adamson <andros@xxxxxxxxxx> >> >> This patch set is a Request for Comments: I think it does a fair job of >> handling the issue - but it is complicated, and other brains might find a >> better way to get the job done. >> >> We must avoid buffering a WRITE that is using a credential key (e.g. a GSS >> context key) that is about to expire or has expired. We currently will >> paint ourselves into a corner by returning success to the applciation >> for such a buffered WRITE, only to discover that we do not have permission when >> we attempt to flush the WRITE (and potentially associated COMMIT) to disk. >> This results in data corruption. >> >> First, a couple of "setup" patches: >> 1) Patch SUNRPC handle EKEYEXPIRED in call_refreshresult returns EACCES to the >> application on an expired or non-existent gss context when the users (Kerberos) >> credentials have also expired or are non-existent. Current behavior is to >> retry upcalls to GSSD forever. Please see patch comment for detail. >> >> 2) Patch SUNRPC set gss gc_expiry to full lifetime works in conjunction with >> the gssd patch "0001-GSSD-Pass-GSS_context-lifetime-to-the-kernel". The >> gssd patch passes the actual remaining TGT lifetime in the downcall, and >> this kernel patch sets the gss context gc_expiry to this lifetime. >> >> Then the two patches that avoid using an expired credential key: >> 3) Patch SUNRPC new rpc_credops to test credential expiry is the heart of this >> work. It provides the RPC layer helper functions to allow NFS to manage >> data in the face of expired credentials >> >> 4) Patch NFS avoid expired credential keys for buffered writes calls the >> RPC helper functions. >> >> Pages for buffered WRITEs are allocated in nfs_write_begin where we have an >> nfs_open_context and associated rpc_cred. This is a generic rpc_cred, NOT >> the gss_cred used in the actual WRITE RPC. Each WRITE RPC call takes the generic >> rpc_cred (or uses the 'current_cred') uid and uses it to lookup the associated >> gss_cred and gss_context in the call_refresh RPC state. So, there is a >> one-to-one association between the nfs_open_context generic_cred and a >> gss_cred with a matching uid and a valid non expired gss context. >> >> We need to check the nfs_open_context generic cred 'underlying' gss_cred >> gss_context gc_expiry in nfs_write_begin to determine if there is enough time >> left in the gss_context lifetime to complete the buffered WRITE. >> >> I started by adding a "key_timeout" rpc_authops routine only set by the generic >> auth to do this work, called by rpcauth_key_timeout_notify, called from >> nfs_write_begin. It does the lookup of the gss_cred (see the patch for >> fast-tracking of non-gss underlying creds) and then tests the gss_context >> gc_expiry against timeouts by calling a new crkey_timeout rpc_credops set >> only for the gss_cred. >> >> I came up with the idea of two credential key timeout watermarks: >> 1) A high watermark, RPC_KEY_EXPIRE_TIMEO set to 90 seconds. >> 2) A low watermark, RPC_KEY_EXPIRE_FAIL set to 30 seconds. >> NOTE: these timeouts are guesses that work in a VM environment. We may want to >> make them adjustable via module parameters. >> >> If key_timeout is called on a credential with an underlying credential key that >> will expire within high watermark seconds, we set the RPC_CRED_KEY_EXPIRE_SOON >> flag in the generic_cred acred so that the NFS layer can clean up prior to >> key expiration It does this by calling a new crkey_to_expire rpc_credop set >> only for generic creds that tests for the RPC_CRED_KEY_EXPIRE_SOON flag. >> >> If the RPC_CRED_KEY_EXPIRE_SOON flag is set in the nfs_open_context generic >> cred (the acred portion), then nfs_write_end will call nfs_wb_all >> on EVERY WRITE CALL it sees, and nfs_write_rpcsetup will send NFS_FILE_SYNC >> WRITEs. The idea of the high watermark is to give time to flush all buffered >> WRITEs and COMMITs, as well as to continue to WRITE, but only with >> NFS_FILE_SYNC, allowing the application to try to finish writing before the >> gss context expires. NOTE that this means EACH WRITE within the high watermark >> timeout is a singe PAGE of NFS_FILE_SYNC. I think this is fine, because we are >> in a failure mode - the most important thing is to NOT fail a buffered WRITE.. >> >> If key_timeout is called on a credential key that will expire within low >> watermark seconds, we return EACCES from nfs_write_begin to stop >> the buffered WRITE using the credential key. >> >> Checking a generic credential's underlying credential involves a cred lookup. >> To avoid this lookup in the normal case when the underlying credential has >> a key that is valid (before the high watermark), a notify flag is set in >> the generic credential the first time the key_timeout is called. The >> generic credential then stops checking the underlying credential key expiry, and >> the underlying credential (gss_cred) match routine then checks the key >> expiration upon each normal use and sets a flag in the associated generic >> credential only when the key expiration is within the high watermark. >> This in turn signals the generic credential key_timeout to perform the extra >> credetial lookup thereafter. >> >> TING: >> >> I have tested these patches with a TGT of 5 minutes, and a modified >> Connectathon special test: bigfile.c that only writes and never flushes. >> I also increased the file size to 524288000 bytes to give me time. >> I named the test buffered-write. >> >> I kinit, and run 6 instances of buffered write with the first and maybe second >> test completing prior to TGT expiration, and the third/fourth tests spanning >> the high water mark. The fifth test starts within the high water mark, and the >> sixth test starts within the low water mark. >> >> I have seen the expected behavior: test 1,2 succeed. Tests 3-6 fail with >> Permission Denied. Tests 3 and 4 start with normal buffered UNSTABLE writes >> then switch to single page NFS_FILE_SYNC writes with a COMMIT for the >> normal buffered writes. Test 5 has only single page NFS_FILE_SYNC writes, >> test 6 simply fails. >> >> None of the WRITEs fail. >> >> ISSUES >> 1) Once in a while I see "short write" instead of "Permission >> denied". I cannot reproduce this. Even then wireshark shows all writes succeeding. >> >> 2) Although I have tested re-establishing Kerberos credentials after the >> test run, I have not done any testing re-estabishing Kerberos credentials >> during the a test run that is within the high watermark. >> >> 3) General corner case and other testing >> >> -->Andy >> >> Andy Adamson (4): >> SUNRPC handle EKEYEXPIRED in call_refreshresult >> SUNRPC set gss gc_expiry to full lifetime >> SUNRPC new rpc_credops to test credential expiry >> NFS avoid expired credential keys for buffered writes >> >> fs/nfs/file.c | 10 ++++ >> fs/nfs/internal.h | 2 + >> fs/nfs/nfs3proc.c | 6 +- >> fs/nfs/nfs4filelayout.c | 1 - >> fs/nfs/nfs4proc.c | 18 -------- >> fs/nfs/nfs4state.c | 22 --------- >> fs/nfs/proc.c | 43 ------------------ >> fs/nfs/write.c | 29 ++++++++++++ >> include/linux/sunrpc/auth.h | 27 ++++++++++++ >> net/sunrpc/auth.c | 21 +++++++++ >> net/sunrpc/auth_generic.c | 93 ++++++++++++++++++++++++++++++++++++++++ >> net/sunrpc/auth_gss/auth_gss.c | 51 +++++++++++++++++++--- >> net/sunrpc/clnt.c | 1 + >> 13 files changed, 231 insertions(+), 93 deletions(-) >> > > In general, I like this set... > > My only concern is this behavior described in the last patch: > > "If the buffered WRITE is using a credential key that will expire > within low watermark seconds, fail the WRITE in nfs_write_begin > _before_ the WRITE is buffered and return -EACCES to the application." > > Shouldn't rejecting the write attempt be the purview of the server? The TGT lifetime is shared by the client and the server - it's the same, so I don't think we'd gain anything by doing that. > > It seems to me that we'd be best off just switching to synchronous > writes when we start approaching credential expiration, and letting the > server handle the case where the credential expires. You mean switching to direct IO? I believe that means taking a different path in the VFS layer which I found difficult to do from nfs_write_begin. But maybe I just didn't see a good way to do this. > > It seems to me that the client should just be in the business of > recognizing when the credential might be about to expire and attempt to > ensure that we don't end up with a bunch of buffered data in that case. Yes. -->Andy > > -- > Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html