Re: [PATCH v15 00/26] nfs/nfsd: add support for LOCALIO

"NeilBrown" <neilb@xxxxxxx> · Fri, 13 Sep 2024 09:31:24 +1000

On Thu, 12 Sep 2024, Chuck Lever III wrote:
> 
> 
> > On Sep 10, 2024, at 8:43 PM, NeilBrown <neilb@xxxxxxx> wrote:
> > 
> > On Sat, 07 Sep 2024, Anna Schumaker wrote:
> >> Hi Mike,
> >> 
> >> On 8/31/24 6:37 PM, Mike Snitzer wrote:
> >>> Hi,
> >>> 
> >>> Happy Labor Day weekend (US holiday on Monday)!  Seems apropos to send
> >>> what I hope the final LOCALIO patchset this weekend: its my birthday
> >>> this coming Tuesday, so _if_ LOCALIO were to get merged for 6.12
> >>> inclusion sometime next week: best b-day gift in a while! ;)
> >>> 
> >>> Anyway, I've been busy incorporating all the review feedback from v14
> >>> _and_ working closely with NeilBrown to address some lingering net-ns
> >>> refcounting and nfsd modules refcounting issues, and more (Chnagelog
> >>> below):
> >>> 
> >> 
> >> I've been running tests on localio this afternoon after finishing up going through v15 of the patches (I was most of the way through when you posted v16, so I haven't updated yet!). Cthon tests passed on all NFS versions, and xfstests passed on NFS v4.x. However, I saw this crash from xfstests with NFS v3:
> >> 
> >> [ 1502.440896] run fstests generic/633 at 2024-09-06 14:04:17
> >> [ 1502.694356] process 'vfstest' launched '/dev/fd/4/file1' with NULL argv: empty string added
> >> [ 1502.699514] Oops: general protection fault, probably for non-canonical address 0x6c616e69665f6140: 0000 [#1] PREEMPT SMP NOPTI
> >> [ 1502.700970] CPU: 3 UID: 0 PID: 513 Comm: nfsd Not tainted 6.11.0-rc6-g0c79a48cd64d-dirty+ #42323 70d41673e6cbf8e3437eb227e0a9c3c46ed3b289
> >> [ 1502.702506] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
> >> [ 1502.703593] RIP: 0010:nfsd_cache_lookup+0x2b3/0x840 [nfsd]
> >> [ 1502.704474] Code: 8d bb 30 02 00 00 bb 01 00 00 00 eb 12 49 8d 46 10 48 8b 08 ff c3 48 85 c9 0f 84 9c 00 00 00 49 89 ce 4c 8d 61 c8 41 8b 45 00 <3b> 41 c8 75 1f 41 8b 45 04 41 3b 46 cc 74 15 8b 15 2c c6 b8 f2 be
> >> [ 1502.706931] RSP: 0018:ffffc27ac0a2fd18 EFLAGS: 00010206
> >> [ 1502.707547] RAX: 00000000b95691f7 RBX: 0000000000000002 RCX: 6c616e69665f6178
> > 
> > This doesn't look like code anywhere near the changes that LOCALIO
> > makes.
> > 
> > I dug around and the faulting instruction is 
> >   cmp    -0x38(%rcx),%eax 
> > 
> > The -0x38 points to nfsd_cache_insert().  -0x38 is the index back
> > from the rbnode pointer to c_key.k_xid.  So the rbtree is corrupt.
> > %rcx is 6c616e69665f6178 which is "xa_final".  So that rbtree node has
> > been over-written or freed and re-used.
> > 
> > It looks like
> > 
> > Commit add1511c3816 ("NFSD: Streamline the rare "found" case")
> > 
> > moved a call to nfsd_reply_cache_free_locked() that was inside a region
> > locked with ->cache_lock out of that region.
> 
> My reading of the current code is that cache_lock is held
> during the nfsd_reply_cache_free_locked() call.
> 
> add1511c3816 simply moved the call site from before a "goto"
> to after the label it branches to. What am I missing?

Yes, I let myself get confused by the gotos.  As you say that patch
didn't move the call out of the locked region.  Sorry.

I can't see anything wrong with the locking or tree management in
nfscache.c, yet this Oops looks a lot like a corrupted rbtree.
It *could* be something external stomping on the memory but I think
that is unlikely.  I'd rather have a more direct explanation....  Not
today though it seems.

NeilBrown