Re: [PATCH] nfsd: fix race condition in nfsd_file_acquire

Hayden Wong <haydenw.kernel@xxxxxxxxx> · Tue, 22 Aug 2023 17:31:39 +0800

On Mon, Aug 21, 2023 at 8:27 AM NeilBrown <neilb@xxxxxxx> wrote:
>
> On Fri, 18 Aug 2023, Haodong Wong wrote:
> > Before Kernel 6.1, we observed the following OOPS in the stress test
> > caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING,
> > and smp_mb__after_atomic() should be a paire.
>
> If you saw this "before kernel 6.1" does that mean you don't see it
> after kernel 6.1?  So has it already been fixed?
>
I just found after the patched c4c649ab413ba "NFSD: Convert filecache
to rhltable" ,  this issue should not have .
It was fixed as below:
-nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
+nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
+               bool want_gc)
 {
        struct nfsd_file *nf;

        nf = kmem_cache_alloc(nfsd_file_slab, GFP_KERNEL);
-       if (nf) {
-               INIT_LIST_HEAD(&nf->nf_lru);
-               nf->nf_birthtime = ktime_get();
-               nf->nf_file = NULL;
-               nf->nf_cred = get_current_cred();
-               nf->nf_net = key->net;
-               nf->nf_flags = 0;
-               __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
-               __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
-               if (key->gc)
-                       __set_bit(NFSD_FILE_GC, &nf->nf_flags);
-               nf->nf_inode = key->inode;
-               refcount_set(&nf->nf_ref, 1);
-               nf->nf_may = key->need;
-               nf->nf_mark = NULL;
-       }
+       if (unlikely(!nf))
+               return NULL;
+
+       INIT_LIST_HEAD(&nf->nf_lru);
+       nf->nf_birthtime = ktime_get();
+       nf->nf_file = NULL;
+       nf->nf_cred = get_current_cred();
+       nf->nf_net = net;
+       nf->nf_flags = want_gc ?
+               BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING) |
BIT(NFSD_FILE_GC) :
+               BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING);
+       nf->nf_inode = inode;
+       refcount_set(&nf->nf_ref, 1);
+       nf->nf_may = need;
+       nf->nf_mark = NULL;
        return nf;
 }

Before above patch, this OOPS happen as below

Task A
                                  Task B

nfs_read
                                nfs_read
      nfsd_file_acquire
                               nfsd_file_acquire
          __set_bit(NFSD_FILE_HASHED, &nf->nf_flags)
             nf = nfsd_file_find_locked();

wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING);
            since rcu_assign_pointer has barrier in
hlist_add_head_rcu, but before smp_store_release in which
            two __set_bit can cross update rcu hlist head pointer
(first), so Task B wait_on_bit didn't wait, just go below:

                                                 *pnf = nf;

                                        file = nf->nf_file

                                        file->f_op->splice_read  *OOPS
here in nfs_read!

I also found similar issue at here https://lkml.org/lkml/2023/1/4/880

Since this  patch can fix this issue, I discard my commit
Very glad to receive your reply.
Thanks a lot!

Regards,
Haodong Wong

> What kernel are you targeting with your patch?  It doesn't apply to
> mainline, but looks like it might to 6.0.  The oops message is from
> 5.10.104.  Maybe that is where you want a fix?
>
The target is before kernel 6.1

> I assume you want this fix to go to a -stable kernel?  It would be good
> to identify which upstream patch fixed the problem, then either backport
> that, or explain why something simpler is needed.
>

> >
> > Task A:                         Task B:
> >
> > nfsd_file_acquire:
> >
> >                           new = nfsd_file_alloc()
> >                           open_file:
> >                           refcount_inc(&nf->nf_ref);
>
> The key step in Task A is
>         hlist_add_head_rcu(&nf->nf_node,
>                  &nfsd_file_hashtbl[hashval].nfb_head);
>
> Until that completes, nfsd_file_find_locked() cannot find the new file.
> hlist_add_head_rcu() uses "rcu_assign_pointer()" which should include
> all the barriers needed.
>
> >                                  nf = nfsd_file_find_locked();
> >                                  wait_for_construction:
> >
> >                                  since nf_flags is zero it will not wait
> >
> >                                  wait_on_bit(&nf->nf_flags,
> >                                                     NFSD_FILE_PENDING);
> >
> >                                 if (status == nfs_ok) {
> >                                      *pnf = nf;      //OOPS happen!
>
> The oops message suggests that after nfsd_read() calls
> nfsd_file_acquire() there is no error, but nf is NULL.
> So the  nf->nf_file access causes the oops.  nf_file is at offset
> 0x28, which is the virtual address mentioned in the oops.
>
> So do you think 'nf' is NULL at this point where you write "OOPS
> happen!" ??

Sorry, something missing, it oops happened in nfs_read after
nfsd_file_acquire *pnf = nf, because it copy nf_file is still NULL as
above

> I cannot see how that might happen even if wait_on_bit() didn't
> actually wait.
>
> Maybe if you could explain if a bit more detail what you think is
> happening.  What exactly is NULL which causes the OOPS, and how exactly
> does it end up being NULL.
> I cannot see what might be the cause, but the oops makes it clear that
> it did happen.
>
> Also is this a pure 5.10.104 kernel, or are there other patches on it?
>
It is applied with PREEMPT_RT Patch

> Thanks,
> NeilBrown

>
>
>
> >
> > Unable to handle kernel NULL pointer at virtual address 0000000000000028
> > Mem abort info:
> >   ESR = 0x96000004
> >   EC = 0x25: DABT (current EL), IL = 32 bits
> >   SET = 0, FnV = 0
> >   EA = 0, S1PTW = 0
> > Data abort info:
> >   ISV = 0, ISS = 0x00000004
> >   CM = 0, WnR = 0
> > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000
> > [0000000000000028] pgd=0000000000000000, p4d=0000000000000000
> > Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP
> > CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1
> > pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--)
> > pc : nfsd_read+0x78/0x280 [nfsd]
> > lr : nfsd_read+0x68/0x280 [nfsd]
> > sp : ffff80001c0b3c70
> > x29: ffff80001c0b3c70 x28: 0000000000000000
> > x27: 0000000000000002 x26: ffff0000c8a3ca70
> > x25: ffff0000c8a45180 x24: 0000000000002000
> > x23: ffff0000c8a45178 x22: ffff0000c8a45008
> > x21: ffff0000c31aac40 x20: ffff0000c8a3c000
> > x19: 0000000000000000 x18: 0000000000000001
> > x17: 0000000000000007 x16: 00000000b35db681
> > x15: 0000000000000156 x14: ffff0000c3f91300
> > x13: 0000000000000000 x12: 0000000000000000
> > x11: 0000000000000000 x10: 0000000000000000
> > x9 : 0000000000000000 x8 : ffff000118014a80
> > x7 : 0000000000000002 x6 : ffff0002559142dc
> > x5 : ffff0000c31aac40 x4 : 0000000000000004
> > x3 : 0000000000000001 x2 : 0000000000000000
> > x1 : 0000000000000001 x0 : ffff000255914280
> > Call trace:
> >  nfsd_read+0x78/0x280 [nfsd]
> >  nfsd3_proc_read+0x98/0xc0 [nfsd]
> >  nfsd_dispatch+0xc8/0x160 [nfsd]
> >  svc_process_common+0x440/0x790
> >  svc_process+0xb0/0xd0
> >  nfsd+0xfc/0x160 [nfsd]
> >  kthread+0x17c/0x1a0
> >  ret_from_fork+0x10/0x18
> >
> > Signed-off-by: Haodong Wong <haydenw.kernel@xxxxxxxxx>
> > ---
> >  fs/nfsd/filecache.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index e30e1ddc1ace..ba980369e6b4 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -974,8 +974,12 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >       nfsd_file_slab_free(&new->nf_rcu);
> >
> >  wait_for_construction:
> > +     /* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */
> > +     smp_rmb();
> >       wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE);
> >
> > +     /* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */
> > +     smp_mb__after_atomic();
> >       /* Did construction of this file fail? */
> >       if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> >               if (!retry) {
> > @@ -1018,8 +1022,11 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >       nf = new;
> >       /* Take reference for the hashtable */
> >       refcount_inc(&nf->nf_ref);
> > -     __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> >       __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> > +     /* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */
> > +     smp_wmb();
> > +     __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> > +
> >       list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> >       hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head);
> >       ++nfsd_file_hashtbl[hashval].nfb_count;
> > --
> > 2.25.1
> >
> >
>