Re: Soft lockups on kerberised NFSv4.0 clients

Tuomas Räsänen <tuomasjjrasanen@xxxxxxxxxx> · Wed, 3 Sep 2014 07:01:32 +0000 (UTC)

----- Original Message -----
> From: "Tuomas Räsänen" <tuomasjjrasanen@xxxxxxxxxx>
> 
> ----- Original Message -----
> > From: "Jeff Layton" <jlayton@xxxxxxxxxxxxxxx>
> > 
> > Ok, now that I look closer at your stack trace the problem appears to
> > be that the unlock code is waiting for the lock context's io_count to
> > drop to zero before allowing the unlock to proceed.
> > 
> > That likely means that there is some outstanding I/O that isn't
> > completing, but it's possible that the problem is the CB_RECALL is
> > being ignored. This will probably require some analysis of wire captures.
> 
> The lockup mechnism seems to be as follows: the process (which is always
> firefox) is killed, and it tries to unlock the file (which is always a
> mmapped sqlite3 WAL index) which still has some pending IOs going on. The
> return value of nfs_wait_bit_killable() (-ERESTARTSYS from
> fatal_signal_pending(current)) is ignored and the process just keeps looṕing
> because io_count seems to be stuck at 1 (I still don't know why..). This
> raised few questions:
> 
> Why the return value of nfs_wait_bit_killable() is not handled? Should it be
> handled and if yes, how?
> 
> Why the whole iocounter wait is not just implemented using wait_on_bit()?
> 
> I changed do_unlk() to use wait_on_bit() instead of nfs_iocounter_wait() and
> softlockups seem to have disappeared:
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 284ca90..eb41b32 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -781,7 +781,11 @@ do_unlk(struct file *filp, int cmd, struct file_lock
> *fl, int is_local)
>  
>         l_ctx = nfs_get_lock_context(nfs_file_open_context(filp));
>         if (!IS_ERR(l_ctx)) {
> -               status = nfs_iocounter_wait(&l_ctx->io_count);
> +               struct nfs_io_counter *io_count = &l_ctx->io_count;
> +               status = wait_on_bit(&io_count->flags,
> +                                    NFS_IO_INPROGRESS,
> +                                    nfs_wait_bit_killable,
> +                                    TASK_KILLABLE);
>                 nfs_put_lock_context(l_ctx);
>                 if (status < 0)
>                         return status;
> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> index 2ffebf2..6b9089c 100644
> --- a/fs/nfs/pagelist.c
> +++ b/fs/nfs/pagelist.c
> @@ -87,6 +87,7 @@ nfs_page_free(struct nfs_page *p)
>  static void
>  nfs_iocounter_inc(struct nfs_io_counter *c)
>  {
> +       set_bit(NFS_IO_INPROGRESS, &c->flags);
>         atomic_inc(&c->io_count);
>  }
>  
> Any thoughts? I really want to understand the issue at hand and to help
> fixing it properly.

The same kind of patch was proposed by David Jeffery in http://www.spinics.net/lists/linux-nfs/msg45806.html and the discussion in that thread answered lot of my questions.

The proposed patch was not accepted but David's another patch fixes the softlockup symptom (as tested with jam.c) as well: http://www.spinics.net/lists/linux-nfs/msg45807.html

Case closed.

-- 
Tuomas
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html