On Mon, Aug 21, 2023 at 8:27 AM NeilBrown <neilb@xxxxxxx> wrote: > > On Fri, 18 Aug 2023, Haodong Wong wrote: > > Before Kernel 6.1, we observed the following OOPS in the stress test > > caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING, > > and smp_mb__after_atomic() should be a paire. > > If you saw this "before kernel 6.1" does that mean you don't see it > after kernel 6.1? So has it already been fixed? > I just found after the patched c4c649ab413ba "NFSD: Convert filecache to rhltable" , this issue should not have . It was fixed as below: -nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may) +nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need, + bool want_gc) { struct nfsd_file *nf; nf = kmem_cache_alloc(nfsd_file_slab, GFP_KERNEL); - if (nf) { - INIT_LIST_HEAD(&nf->nf_lru); - nf->nf_birthtime = ktime_get(); - nf->nf_file = NULL; - nf->nf_cred = get_current_cred(); - nf->nf_net = key->net; - nf->nf_flags = 0; - __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); - __set_bit(NFSD_FILE_PENDING, &nf->nf_flags); - if (key->gc) - __set_bit(NFSD_FILE_GC, &nf->nf_flags); - nf->nf_inode = key->inode; - refcount_set(&nf->nf_ref, 1); - nf->nf_may = key->need; - nf->nf_mark = NULL; - } + if (unlikely(!nf)) + return NULL; + + INIT_LIST_HEAD(&nf->nf_lru); + nf->nf_birthtime = ktime_get(); + nf->nf_file = NULL; + nf->nf_cred = get_current_cred(); + nf->nf_net = net; + nf->nf_flags = want_gc ? + BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING) | BIT(NFSD_FILE_GC) : + BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING); + nf->nf_inode = inode; + refcount_set(&nf->nf_ref, 1); + nf->nf_may = need; + nf->nf_mark = NULL; return nf; } Before above patch, this OOPS happen as below Task A Task B nfs_read nfs_read nfsd_file_acquire nfsd_file_acquire __set_bit(NFSD_FILE_HASHED, &nf->nf_flags) nf = nfsd_file_find_locked(); wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING); since rcu_assign_pointer has barrier in hlist_add_head_rcu, but before smp_store_release in which two __set_bit can cross update rcu hlist head pointer (first), so Task B wait_on_bit didn't wait, just go below: *pnf = nf; file = nf->nf_file file->f_op->splice_read *OOPS here in nfs_read! I also found similar issue at here https://lkml.org/lkml/2023/1/4/880 Since this patch can fix this issue, I discard my commit Very glad to receive your reply. Thanks a lot! Regards, Haodong Wong > What kernel are you targeting with your patch? It doesn't apply to > mainline, but looks like it might to 6.0. The oops message is from > 5.10.104. Maybe that is where you want a fix? > The target is before kernel 6.1 > I assume you want this fix to go to a -stable kernel? It would be good > to identify which upstream patch fixed the problem, then either backport > that, or explain why something simpler is needed. > > > > > Task A: Task B: > > > > nfsd_file_acquire: > > > > new = nfsd_file_alloc() > > open_file: > > refcount_inc(&nf->nf_ref); > > The key step in Task A is > hlist_add_head_rcu(&nf->nf_node, > &nfsd_file_hashtbl[hashval].nfb_head); > > Until that completes, nfsd_file_find_locked() cannot find the new file. > hlist_add_head_rcu() uses "rcu_assign_pointer()" which should include > all the barriers needed. > > > nf = nfsd_file_find_locked(); > > wait_for_construction: > > > > since nf_flags is zero it will not wait > > > > wait_on_bit(&nf->nf_flags, > > NFSD_FILE_PENDING); > > > > if (status == nfs_ok) { > > *pnf = nf; //OOPS happen! > > The oops message suggests that after nfsd_read() calls > nfsd_file_acquire() there is no error, but nf is NULL. > So the nf->nf_file access causes the oops. nf_file is at offset > 0x28, which is the virtual address mentioned in the oops. > > So do you think 'nf' is NULL at this point where you write "OOPS > happen!" ?? Sorry, something missing, it oops happened in nfs_read after nfsd_file_acquire *pnf = nf, because it copy nf_file is still NULL as above > I cannot see how that might happen even if wait_on_bit() didn't > actually wait. > > Maybe if you could explain if a bit more detail what you think is > happening. What exactly is NULL which causes the OOPS, and how exactly > does it end up being NULL. > I cannot see what might be the cause, but the oops makes it clear that > it did happen. > > Also is this a pure 5.10.104 kernel, or are there other patches on it? > It is applied with PREEMPT_RT Patch > Thanks, > NeilBrown > > > > > > > Unable to handle kernel NULL pointer at virtual address 0000000000000028 > > Mem abort info: > > ESR = 0x96000004 > > EC = 0x25: DABT (current EL), IL = 32 bits > > SET = 0, FnV = 0 > > EA = 0, S1PTW = 0 > > Data abort info: > > ISV = 0, ISS = 0x00000004 > > CM = 0, WnR = 0 > > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000 > > [0000000000000028] pgd=0000000000000000, p4d=0000000000000000 > > Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP > > CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1 > > pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--) > > pc : nfsd_read+0x78/0x280 [nfsd] > > lr : nfsd_read+0x68/0x280 [nfsd] > > sp : ffff80001c0b3c70 > > x29: ffff80001c0b3c70 x28: 0000000000000000 > > x27: 0000000000000002 x26: ffff0000c8a3ca70 > > x25: ffff0000c8a45180 x24: 0000000000002000 > > x23: ffff0000c8a45178 x22: ffff0000c8a45008 > > x21: ffff0000c31aac40 x20: ffff0000c8a3c000 > > x19: 0000000000000000 x18: 0000000000000001 > > x17: 0000000000000007 x16: 00000000b35db681 > > x15: 0000000000000156 x14: ffff0000c3f91300 > > x13: 0000000000000000 x12: 0000000000000000 > > x11: 0000000000000000 x10: 0000000000000000 > > x9 : 0000000000000000 x8 : ffff000118014a80 > > x7 : 0000000000000002 x6 : ffff0002559142dc > > x5 : ffff0000c31aac40 x4 : 0000000000000004 > > x3 : 0000000000000001 x2 : 0000000000000000 > > x1 : 0000000000000001 x0 : ffff000255914280 > > Call trace: > > nfsd_read+0x78/0x280 [nfsd] > > nfsd3_proc_read+0x98/0xc0 [nfsd] > > nfsd_dispatch+0xc8/0x160 [nfsd] > > svc_process_common+0x440/0x790 > > svc_process+0xb0/0xd0 > > nfsd+0xfc/0x160 [nfsd] > > kthread+0x17c/0x1a0 > > ret_from_fork+0x10/0x18 > > > > Signed-off-by: Haodong Wong <haydenw.kernel@xxxxxxxxx> > > --- > > fs/nfsd/filecache.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index e30e1ddc1ace..ba980369e6b4 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -974,8 +974,12 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, > > nfsd_file_slab_free(&new->nf_rcu); > > > > wait_for_construction: > > + /* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */ > > + smp_rmb(); > > wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE); > > > > + /* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */ > > + smp_mb__after_atomic(); > > /* Did construction of this file fail? */ > > if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) { > > if (!retry) { > > @@ -1018,8 +1022,11 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, > > nf = new; > > /* Take reference for the hashtable */ > > refcount_inc(&nf->nf_ref); > > - __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); > > __set_bit(NFSD_FILE_PENDING, &nf->nf_flags); > > + /* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */ > > + smp_wmb(); > > + __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); > > + > > list_lru_add(&nfsd_file_lru, &nf->nf_lru); > > hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head); > > ++nfsd_file_hashtbl[hashval].nfb_count; > > -- > > 2.25.1 > > > > >