Re: NFS v4 + kerberos: 4 minute window of slowness

Dan Mezhiborsky <daniel.mezhiborsky@xxxxxxxxxx> · Fri, 22 Oct 2021 14:22:04 -0400

On Wed, 2020-05-20 at 11:48 +0100, Robert Milkowski wrote:
> Polite ping...
> 
> > -----Original Message-----
> > From: Robert Milkowski <rmilkowski@xxxxxxxxx>
> > Sent: 14 May 2020 21:39
> > To: linux-nfs@xxxxxxxxxxxxxxx; trond.myklebust@xxxxxxxxxxxxxxx;
> > 'Chuck
> > Lever' <chuck.lever@xxxxxxxxxx>; 'Schumaker, Anna'
> > <Anna.Schumaker@xxxxxxxxxx>
> > Subject: RE: NFS v4 + kerberos: 4 minute window of slowness
> > 
> > +Trond, Chuck, Anna
> > 
> > 
> > Ping...
> > 
> > > -----Original Message-----
> > > From: Robert Milkowski <rmilkowski@xxxxxxxxx>
> > > Sent: 05 May 2020 18:35
> > > To: linux-nfs@xxxxxxxxxxxxxxx
> > > Subject: NFS v4 + kerberos: 4 minute window of slowness
> > > 
> > > Hi,
> > > 
> > > Currently the last 4 minutes before kernel gss context expires,
> > > all
> > > writes to NFSv4 are synchronous and all dirty pages associated
> > > with
> > > the file being written to are being destaged.
> > > This will continue for the 4 minutes until the context expires,
> > > at
> > > which point it gets refreshed and everything gets back to normal.
> > > 
> > > The rpc.gssd by default sets the timeout to match the Kerberos
> > > service
> > > ticket, but this can be lowered by using -t option.
> > > In fact many sites set it to lower value, like for example 30
> > > minutes.
> > > This means that every 30 minutes, the last 4 minutes results in
> > > severely slower writes (assuming these are buffered - no O_DSYNC,
> > etc.).
> > > In extreme case, when one sets the timeout to 5 minutes, during
> > > the 4
> > > minutes out of the minutes, there will be the slowness observed.
> > > 
> > > 
> > > I understand the idea behind this mechanism - I guess it tries to
> > > avoid situation when a gss context can't be refreshed (due to
> > > error or
> > > account disabled, etc.), and it expires suddenly nfs client
> > > wouldn't
> > > be able to destage all the buffered writes. The way it is
> > > currently
> > > implemented though is rather crude.
> > > In my opinion, instead of making everything slow for the whole 4
> > > minutes, it should first try to refresh the context immediately
> > > and if
> > > successful things go back to normal, if it can't refresh the
> > > context
> > > then it should continue with the previous one and revert to the
> > > current behaviour. I implemented a naïve quick fix which does
> > > exactly
> > > that (attached at the end of this email).
> > > 
> > > 
> > > How to re-produce.
> > > 
> > > 
> > > $ uname -r
> > > 5.7.0-rc4+
> > > 
> > > $ grep -- -t /etc/sysconfig/nfs
> > > RPCGSSDARGS="-t 300"
> > > 
> > > I'm setting it to 5 minutes so I can quickly see the behaviour
> > > without
> > > having to wait for too long.
> > > 
> > > 
> > > Now, let's generate a small write every 10s to a file on
> > > nfsv4,sec=krb5 filesystem and record how long each write takes.
> > > Since these are buffered writes it should be very quick most of
> > > the
> > > time.
> > > 
> > > $ while [ 1 ]; do strace -qq -tT -v -e trace=write /bin/echo aa
> > > >f1;
> > > rm f1; sleep 10; done
> > > 
> > > 15:22:41 write(1, "aa\n", 3)            = 3 <0.000108>
> > > 15:22:51 write(1, "aa\n", 3)            = 3 <0.000113>
> > > 15:23:01 write(1, "aa\n", 3)            = 3 <0.000111>
> > > 15:23:11 write(1, "aa\n", 3)            = 3 <0.000112>
> > > 15:23:21 write(1, "aa\n", 3)            = 3 <0.001510>     <<<<<<
> > > becomes
> > > slow
> > > 15:23:31 write(1, "aa\n", 3)            = 3 <0.001622>
> > > 15:23:41 write(1, "aa\n", 3)            = 3 <0.001553>
> > > 15:23:51 write(1, "aa\n", 3)            = 3 <0.001495>
> > > ...
> > > 15:27:01 write(1, "aa\n", 3)            = 3 <0.001528>
> > > 15:27:12 write(1, "aa\n", 3)            = 3 <0.001553>
> > > 15:27:22 write(1, "aa\n", 3)            = 3 <0.000104>     <<<<<<
> > > becomes
> > > fast again
> > > 15:27:32 write(1, "aa\n", 3)            = 3 <0.000125>
> > > 15:27:42 write(1, "aa\n", 3)            = 3 <0.000129>
> > > 15:27:52 write(1, "aa\n", 3)            = 3 <0.000113>
> > > 15:28:02 write(1, "aa\n", 3)            = 3 <0.000112>
> > > 15:28:12 write(1, "aa\n", 3)            = 3 <0.000112>
> > > 15:28:22 write(1, "aa\n", 3)            = 3 <0.001510>     <<<<<<
> > > slow
> > > ...
> > > 15:32:02 write(1, "aa\n", 3)            = 3 <0.001501>
> > > 15:32:12 write(1, "aa\n", 3)            = 3 <0.001440>
> > > 15:32:22 write(1, "aa\n", 3)            = 3 <0.000136>     <<<<<<
> > > fast
> > > 15:32:32 write(1, "aa\n", 3)            = 3 <0.000109>
> > > 15:32:42 write(1, "aa\n", 3)            = 3 <0.000110>
> > > 15:32:52 write(1, "aa\n", 3)            = 3 <0.000112>
> > > 15:33:02 write(1, "aa\n", 3)            = 3 <0.000103>
> > > 15:33:12 write(1, "aa\n", 3)            = 3 <0.000112>
> > > 15:33:22 write(1, "aa\n", 3)            = 3 <0.001405>     <<<<<<
> > > slow
> > > 15:33:32 write(1, "aa\n", 3)            = 3 <0.001393>
> > > 15:33:42 write(1, "aa\n", 3)            = 3 <0.001479>
> > > ...
> > > 
> > > 
> > > 
> > > So we have 4 minute long windows of slowness followed by 1 minute
> > > window when writes are fast.
> > > 
> > > 	15:23:21  -  15:27:22        slow
> > > 	15:27:22  -  15:28:22        fast
> > > 	15:28:22  -  15:32:22        slow
> > > 	15:32:22  -  15:33:22        fast
> > > 
> > > 
> > > 
> > > After some tracing with systemtap and looking at the source code,
> > > I
> > > found where the issue is coming from.
> > > The nfs_file_write() function ends up calling
> > > nfs_ctx_key_to_expire()
> > > on each write, which in turn calls gss_key_timeout() which has
> > > hard-coded value of 240s
> > > (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo).
> > > 
> > > 
> > > nfs_file_write()
> > > ...
> > >         result = nfs_key_timeout_notify(file, inode);
> > >         if (result)
> > >                 return result;
> > > ...
> > >         if (nfs_need_check_write(file, inode)) {
> > >                 int err = nfs_wb_all(inode); ...
> > > 
> > > 
> > > /*
> > >  * Avoid buffered writes when a open context credential's key
> > > would
> > >  * expire soon.
> > >  *
> > >  * Returns -EACCES if the key will expire within
> > > RPC_KEY_EXPIRE_FAIL.
> > >  *
> > >  * Return 0 and set a credential flag which triggers the inode to
> > > flush
> > >  * and performs  NFS_FILE_SYNC writes if the key will expired
> > > within
> > >  * RPC_KEY_EXPIRE_TIMEO.
> > >  */
> > > int
> > > nfs_key_timeout_notify(struct file *filp, struct inode *inode) {
> > >         struct nfs_open_context *ctx =
> > > nfs_file_open_context(filp);
> > > 
> > >         if (nfs_ctx_key_to_expire(ctx, inode) &&
> > >             !ctx->ll_cred)
> > >                 /* Already expired! */
> > >                 return -EACCES;
> > >         return 0;
> > > }
> > > 
> > > 
> > > nfs_need_check_write()
> > > ...
> > >         if (nfs_ctx_key_to_expire(ctx, inode))
> > >                 return 1;
> > >         return 0;
> > > 
> > > 
> > > 
> > > nfs_write_end()
> > > ...
> > >         if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
> > >                 status = nfs_wb_all(mapping->host); ...
> > > 
> > > 
> > > 
> > > /*
> > >  * Test if the open context credential key is marked to expire
> > > soon.
> > >  */
> > > bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct
> > > inode
> > > *inode)
> > > {
> > >         struct rpc_auth *auth = NFS_SERVER(inode)->client-
> > > >cl_auth;
> > >         struct rpc_cred *cred = ctx->ll_cred;
> > >         struct auth_cred acred = {
> > >                 .cred = ctx->cred,
> > >         };
> > > 
> > >         if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) {
> > >                 put_rpccred(cred);
> > >                 ctx->ll_cred = NULL;
> > >                 cred = NULL;
> > >         }
> > >         if (!cred)
> > >                 cred = auth->au_ops->lookup_cred(auth, &acred,
> > > 0);
> > >         if (!cred || IS_ERR(cred))
> > >                 return true;
> > >         ctx->ll_cred = cred;
> > >         return !!(cred->cr_ops->crkey_timeout &&
> > >                   cred->cr_ops->crkey_timeout(cred));
> > > }
> > > 
> > > 
> > > 
> > > net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout          =
> > > gss_key_timeout,
> > > 
> > > 
> > > /*
> > >  * Returns -EACCES if GSS context is NULL or will expire within
> > > the
> > >  * timeout (miliseconds)
> > >  */
> > > static int
> > > gss_key_timeout(struct rpc_cred *rc)
> > > {
> > >         struct gss_cred *gss_cred = container_of(rc, struct
> > > gss_cred,
> > > gc_base);
> > >         struct gss_cl_ctx *ctx;
> > >         unsigned long timeout = jiffies + (gss_key_expire_timeo *
> > > HZ);
> > >         int ret = 0;
> > > 
> > >         rcu_read_lock();
> > >         ctx = rcu_dereference(gss_cred->gc_ctx);
> > >         if (!ctx || time_after(timeout, ctx->gc_expiry))
> > >                 ret = -EACCES;
> > >         rcu_read_unlock();
> > > 
> > >         return ret;
> > > }
> > > 
> > > 
> > > 
> > > 
> > > #define GSS_KEY_EXPIRE_TIMEO 240
> > > static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO;
> > > 
> > > 
> > > 
> > > 
> > > 
> > > A naïve attempt at a fix:
> > > 
> > > 
> > > diff --git a/net/sunrpc/auth_gss/auth_gss.c
> > > b/net/sunrpc/auth_gss/auth_gss.c index 25fbd8d9de74..864661bdfdf3
> > > 100644
> > > --- a/net/sunrpc/auth_gss/auth_gss.c
> > > +++ b/net/sunrpc/auth_gss/auth_gss.c
> > > @@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc)
> > > 
> > >         rcu_read_lock();
> > >         ctx = rcu_dereference(gss_cred->gc_ctx);
> > > +        if (ctx && time_after(timeout + (60 * HZ), ctx-
> > > >gc_expiry))
> > > +               clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags);
> > >         if (!ctx || time_after(timeout, ctx->gc_expiry))
> > >                 ret = -EACCES;
> > >         rcu_read_unlock();
> > > 
> > > 
> > > 
> > > 
> > > With the above patch, if there is a write within 300s before a
> > > context
> > > is to expire (use RPCGSSDARGS="-t 400" or any value larger than
> > > 300 to
> > > test), it will now try to refresh the context and if successful
> > > then
> > > writes will be fast again (assuming -t option is >300s and/or krb
> > > ticket is valid for more than 300s).
> > > 
> > > What I haven't tested nor analysed code is what would happen if
> > > it now
> > > fails to refresh the context, but after a quick glance at
> > > gss_refresh() it does seem it would continue using the previous
> > cred...
> > > Is this the correct approach to fix the issue, or can you suggest
> > > some
> > > other approach?
> > > 
> > > 
> > > 
> > > --
> > > Robert Milkowski
> 
> 

Hello all, is this behavior something that can be looked into again? In
some applications, it's preferable to have a short timeout for frequent
auth refreshes.

Thank you!
Dan Mezhiborsky