Re: [PATCH 0/8] NFSD: clean up locking.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2022-07-15 at 12:36 +1000, NeilBrown wrote:
> On Thu, 14 Jul 2022, Chuck Lever III wrote:
> > 
> > > On Jul 13, 2022, at 12:32 AM, NeilBrown <neilb@xxxxxxx> wrote:
> > > 
> > > On Wed, 13 Jul 2022, Chuck Lever III wrote:
> > > > 
> > > > > On Jul 11, 2022, at 10:33 PM, NeilBrown <neilb@xxxxxxx> wrote:
> > > > > 
> > > > > On Thu, 07 Jul 2022, Chuck Lever III wrote:
> > > > > > 
> > > > > > > On Jul 6, 2022, at 12:18 AM, NeilBrown <neilb@xxxxxxx> wrote:
> > > > > > > 
> > > > > > > This series prepares NFSD to be able to adjust to work with a proposed
> > > > > > > patch which allows updates to directories to happen in parallel.
> > > > > > > This patch set changes the way directories are locked, so the present
> > > > > > > series cleans up some locking in nfsd.
> > > > > > > 
> > > > > > > Specifically we remove fh_lock() and fh_unlock().
> > > > > > > These functions are problematic for a few reasons.
> > > > > > > - they are deliberately idempotent - setting or clearing a flag
> > > > > > > so that a second call does nothing. This makes locking errors harder,
> > > > > > > but it results in code that looks wrong ... and maybe sometimes is a
> > > > > > > little bit wrong.
> > > > > > > Early patches clean up several places where this idempotent nature of
> > > > > > > the functions is depended on, and so makes the code clearer.
> > > > > > > 
> > > > > > > - they transparently call fh_fill_pre/post_attrs(), including at times
> > > > > > > when this is not necessary. Having the calls only when necessary is
> > > > > > > marginally more efficient, and arguably makes the code clearer.
> > > > > > > 
> > > > > > > nfsd_lookup() currently always locks the directory, though often no lock
> > > > > > > is needed. So a patch in this series reduces this locking.
> > > > > > > 
> > > > > > > There is an awkward case that could still be further improved.
> > > > > > > NFSv4 open needs to ensure the file is not renamed (or unlinked etc)
> > > > > > > between the moment when the open succeeds, and a later moment when a
> > > > > > > "lease" is attached to support a delegation. The handling of this lock
> > > > > > > is currently untidy, particularly when creating a file.
> > > > > > > It would probably be better to take a lease immediately after
> > > > > > > opening the file, and then discarding if after deciding not to provide a
> > > > > > > delegation.
> > > > > > > 
> > > > > > > I have run fstests and cthon tests on this, but I wouldn't be surprised
> > > > > > > if there is a corner case that I've missed.
> > > > > > 
> > > > > > Hi Neil, thanks for (re)posting.
> > > > > > 
> > > > > > Let me make a few general comments here before I send out specific
> > > > > > review nits.
> > > > > > 
> > > > > > I'm concerned mostly with how this series can be adequately tested.
> > > > > > The two particular areas I'm worried about:
> > > > > > 
> > > > > > - There are some changes to NFSv2 code, which is effectively
> > > > > > fallow. I think I can run v2 tests, once we decide what tests
> > > > > > should be run.
> > > > > 
> > > > > I hope we can still test v2... I know it is disabled by default..
> > > > > If we can't test it, we should consider removing it.
> > > > 
> > > > The work of deprecating and removing NFSv2 is already under way.
> > > > I think what I'm asking is if /you/ have tested the NFSv2 changes.
> > > 
> > > That's a question I can answer. I haven't. I will.
> > 
> > Just in case it's not clear by now, NFSv2 (and NFSv3)
> > stability is the reason I don't want to perturb the
> > NFSD VFS API code in any significant way unless it's
> > absolutely needed.
> 
> "absolutely" seems like a rather high bar.  It makes the goal of
> "stability" seem more like "ossification".
> Certainly we don't want to break things.  Equally certainly we will.
> I don't think "hold back useful changes out of fear" is the path to
> success.  Testing, review, and responding to bug reports is what we find
> works best - and what we generally do.
> 
> And I wouldn't class NFSv3 at all with NFSv2 (not even in parentheses).
> NFSv2 has very few (if any) users in the broad community, so bugs are
> likely to go unnoticed for extended periods.  NFSv3 is, I believe, still
> widely used, though not as widely as v4.  There are use-cases where I
> would recommend v3.
> 
> e.g.  a case could certainly be made to not extend my directory-locking
> changes to v2-specific code, but v3 should get them.
> 
> > 
> > 
> > > > > > Secondarily, the series adds more bells and whistles to the generic
> > > > > > NFSD VFS APIs on behalf of NFSv4-specific requirements. In particular:
> > > > > > 
> > > > > > - ("NFSD: change nfsd_create() to unlock directory before returning.")
> > > > > > makes some big changes to nfsd_create(). But that helper itself
> > > > > > is pretty small. Seems like cleaner code would result if NFSv4
> > > > > > had its own version of nfsd_create() to deal with the post-change
> > > > > > cases.
> > > > > 
> > > > > I would not like that approach. Duplicating code is rarely a good idea.
> > > > 
> > > > De-duplicating code /can/ be a good idea, but isn't always a good
> > > > idea. If the exceptional cases add a lot of logic, that can make the
> > > > de-duplicated code difficult to read and reason about, and it can
> > > > make it brittle, just as it does in this case. Modifications on
> > > > behalf of NFSv4 in this common piece of code is possibly hazardous
> > > > to NFSv3, and navigating around the exception logic makes it
> > > > difficult to understand and review.
> > > 
> > > Are we looking at the same code?
> > > The sum total of extra code needed for v4 is:
> > > - two extra parameters:
> > > 	 void (*post_create)(struct svc_fh *fh, void *data),
> > > 	 void *data)
> > > - two lines of code:
> > > 	if (!err && post_create)
> > > 		post_create(resfhp, data);
> > > 
> > > does that really make anything hard to follow?
> > 
> > The synopsis of nfsd_create() is already pretty cluttered.
> 
> Would it help to collect related details (name, type, attributes, acl,
> label) into a struct and pass a reference to that around?
> One awkwardness in my latest patch is that the acl and label are not set
> in nfsd_create_setattr().  If they were passed around together with the
> attributes, that would be a lot easier to achieve.
> 
> > 
> > You're adding that extra code in nfsd_symlink() too IIRC? And,
> > this change adds a virtual function call (which has significant
> > overhead these days) for reasons of convenience rather than
> > necessity.
> 
> "significant"?  In context of creation a file, I don't think one more
> virtual function call is all that significant?
> 
> I started writing code to avoid the function pointer, but the function
> args list exploded.  If you could be happy with e.g.  a struct
> nfs_create_args which contains attrs, acl, label, and was passed into
> nfsd_create_setattr(), then I think that would cleanly avoid the desire
> for a function pointer.
> 
> > 
> > 
> > > > IMO code duplication is not an appropriate design pattern in this
> > > > specific instance.
> > > 
> > > I'm guessing there is a missing negation in there.
> > 
> > Yes, thanks.
> > 
> > 
> > > > > Maybe, rather than passing a function and void* to nfsd_create(), we
> > > > > could pass an acl and a label and do the setting in vfs.c rather then
> > > > > nfs4proc.c. The difficult part of that approach is getting back the
> > > > > individual error statuses. That should be solvable though.
> > > > 
> > > > The bulk of the work in nfsd_create() is done by lookup_one_len()
> > > > and nfsd_create_locked(), both of which are public APIs. The rest
> > > > of nfsd_create() is code that appears in several other places
> > > > already.
> > > 
> > > "several" == 1. The only other call site for nfsd_create_locked() is in
> > > nfsd_proc_create()
> > 
> > But there are 15 call sites under fs/nfsd/ for lookup_one_len().
> > It's a pretty common trope.
> > 
> > 
> > > I would say that the "bulk" of the work is the interplay between
> > > locking, error checking, and these two functions you mention.
> > 
> > You're looking at the details of your particular change, and
> > I'm concerned about how much technical debt is going to
> > continue to accrue in an area shared with legacy protocol
> > implementation.
> > 
> > I'm still asking myself if I can live with the extra parameters
> > and virtual function call. Maybe. IMO, there are three ways
> > forward:
> > 
> > 1. I can merge your patch and move on.
> > 2. I can merge your patch as it is and follow up with clean-up.
> > 3. I can drop your patch and write it the way I prefer.
> > 
> > 
> > > > > > - ("NFSD: reduce locking in nfsd_lookup()") has a similar issue:
> > > > > > nfsd_lookup() is being changed such that its semantics are
> > > > > > substantially different for NFSv4 than for others. This is
> > > > > > possibly an indication that nfsd_lookup() should also be
> > > > > > duplicated into the NFSv4-specific code and the generic VFS
> > > > > > version should be left alone.
> > > > > 
> > > > > Again, I don't like duplication. In this case, I think the longer term
> > > > > solution is to remove the NFSv4 specific locking differences and solve
> > > > > the problem differently. i.e. don't hold the inode locked, but check
> > > > > for any possible rename after getting a lease. Once that is done,
> > > > > nfsd_lookup() can have saner semantics.
> > > > 
> > > > Then, perhaps we should bite that bullet and do that work now.
> > > 
> > > While this does have an appeal, it also looks like the start of a rabbit
> > > hole where I find have to fix up a growing collection of problems before
> > > my patch can land.
> > 
> > That's kind of the nature of the beast.
> > 
> > You are requesting changes that add technical debt with the
> > promise that "sometime in the future" that debt will be
> > erased. Meanwhile, nfsd_lookup() will be made harder to
> > maintain, and it will continue to contain a real bug.
> > 
> > I'm saying if there's a real bug here, addressing that should
> > be the priority.
> > 
> > 
> > > I think balance is needed - certainly asking for some preliminary
> > > cleanup is appropriate. Expecting long standing and subtle issues that
> > > are tangential to the main goal to be resolved first is possibly asking
> > > too much.
> > 
> > Fixing the bug seems to me (perhaps naively) to be directly
> > related to the parallelism of directory operations. Why do
> > you think it is tangential?
> > 
> 
> The bug was exposed by the analysis required for the changes I want, but
> I think that is as close as the connection goes.
> To really see if it is tangential, we would need to come up with a
> solution and see how easily it is ported across my patches.
> 
> As I said in a reply to Jeff, I don't think locking of the parent
> directory should be part of the solution.  After getting a lease, we
> repeat the lookup and see if dentry has changed.  After my patches, that
> would mean calling lookup_one_len_unlocked().  Before my patches, it
> would mean calling fh_unlock() earlier to ensure the directory is no
> longer locked but still calling lookup_one_len_unlocked()
> 
> 
> > Asking for bugs to be addressed before new features go in
> > seems sensible to me, and is a common practice to enable the
> > resulting fixes to be backported. If you don't want to address
> > the bug you mentioned, then someone else will need to, and
> > that's fine. I think the priority should be bug fixes first.
> 
> As a general principle I would agree.
> In this case the bug is minor, long standing, and tangential.
> And the series imposes enough review burden as it is.
> 
> But if Jeff can produce a fix, either to be applied before or after,
> then that would be an excellent solution.
> 
> Thanks,
> NeilBrown
> 
> 


FWIW, I hit this while running fstests against the server with some
draft changes. This crash is not in an area I touched, so I suspect
something in Neil's locking cleanup:

[  434.257242] ------------[ cut here ]------------
[  434.257278] kernel BUG at fs/nfsd/xdr4.h:752!
[  434.257755] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[  434.257937] CPU: 2 PID: 1202 Comm: nfsd Kdump: loaded Tainted: G           OE     5.19.0-rc5+ #316
[  434.258179] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.0-3.el9 04/01/2014
[  434.258371] RIP: 0010:nfsd4_open+0x1940/0x1a30 [nfsd]
[  434.258615] Code: 40 e8 54 04 e2 e6 48 8b 7c 24 40 41 89 c4 e8 57 7e d4 e6 41 f7 d4 4c 8b 44 24 60 66 44 21 63 44 48 8b 54 24 40 e9 46 fc ff ff <0f> 0b 48 8d bd 88 00 00 00 e8 72 80 d4 e6 4c 8b ad 88 00 00 00 49
[  434.259053] RSP: 0018:ffff888134897c10 EFLAGS: 00010246
[  434.259211] RAX: 0000000000000000 RBX: ffff8881348791a0 RCX: ffffffffc07fbb44
[  434.259408] RDX: 1ffff1102691001a RSI: 0000000000000002 RDI: ffff8881348800d1
[  434.259606] RBP: ffff888134880030 R08: 0000000000000000 R09: 0000000000000001
[  434.259802] R10: fffffbfff542dfac R11: 0000000000000001 R12: 0000000000000000
[  434.260000] R13: ffff888133ab8000 R14: 0000000000000000 R15: ffff8881165db400
[  434.260195] FS:  0000000000000000(0000) GS:ffff888225500000(0000) knlGS:0000000000000000
[  434.260466] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  434.260697] CR2: 0000558890b5d178 CR3: 0000000107844000 CR4: 00000000003506e0
[  434.260949] Call Trace:
[  434.261106]  <TASK>
[  434.261260]  ? nfsd4_decode_notsupp+0x10/0x10 [nfsd]
[  434.261549]  ? nfsd4_interssc_connect+0x8e0/0x8e0 [nfsd]
[  434.261858]  ? preempt_count_sub+0x14/0xc0
[  434.262054]  ? percpu_counter_add_batch+0x60/0xd0
[  434.262261]  nfsd4_proc_compound+0x8c6/0xe90 [nfsd]
[  434.262553]  nfsd_dispatch+0x2a9/0x5c0 [nfsd]
[  434.262836]  svc_process_common+0x6ab/0xc30 [sunrpc]
[  434.263162]  ? svc_create_pooled+0x390/0x390 [sunrpc]
[  434.263484]  ? nfsd_svc+0x830/0x830 [nfsd]
[  434.263755]  ? svc_recv+0xabf/0x1310 [sunrpc]
[  434.264074]  svc_process+0x1a3/0x200 [sunrpc]
[  434.264382]  nfsd+0x1d7/0x390 [nfsd]
[  434.264640]  ? nfsd_shutdown_threads+0x200/0x200 [nfsd]
[  434.264926]  kthread+0x167/0x1a0
[  434.265101]  ? kthread_complete_and_exit+0x20/0x20
[  434.265307]  ret_from_fork+0x22/0x30
[  434.265494]  </TASK>
[  434.265652] Modules linked in: rpcsec_gss_krb5(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) rfkill(E) nf_tables(E) nfnetlink(E) iTCO_wdt(E) intel_rapl_msr(E) intel_pmc_bxt(E) iTCO_vendor_support(E) joydev(E) intel_rapl_common(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) lpc_ich(E) nfsd(OE) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) drm(E) fuse(E) sunrpc(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) virtio_blk(E) virtio_console(E) virtio_net(E) net_failover(E) failover(E) qemu_fw_cfg(E)
[  434.267660] ---[ end trace 0000000000000000 ]---
[  434.267870] RIP: 0010:nfsd4_open+0x1940/0x1a30 [nfsd]
[  434.268161] Code: 40 e8 54 04 e2 e6 48 8b 7c 24 40 41 89 c4 e8 57 7e d4 e6 41 f7 d4 4c 8b 44 24 60 66 44 21 63 44 48 8b 54 24 40 e9 46 fc ff ff <0f> 0b 48 8d bd 88 00 00 00 e8 72 80 d4 e6 4c 8b ad 88 00 00 00 49
[  434.268771] RSP: 0018:ffff888134897c10 EFLAGS: 00010246
[  434.268995] RAX: 0000000000000000 RBX: ffff8881348791a0 RCX: ffffffffc07fbb44
[  434.269247] RDX: 1ffff1102691001a RSI: 0000000000000002 RDI: ffff8881348800d1
[  434.269511] RBP: ffff888134880030 R08: 0000000000000000 R09: 0000000000000001
[  434.269775] R10: fffffbfff542dfac R11: 0000000000000001 R12: 0000000000000000
[  434.270030] R13: ffff888133ab8000 R14: 0000000000000000 R15: ffff8881165db400
[  434.270288] FS:  0000000000000000(0000) GS:ffff888225500000(0000) knlGS:0000000000000000
[  434.270632] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  434.270862] CR2: 0000558890b5d178 CR3: 0000000107844000 CR4: 00000000003506e0

faddr2line says:

$ ./scripts/faddr2line --list fs/nfsd/nfsd.ko nfsd4_open+0x1940
nfsd4_open+0x1940/0x1a30:

set_change_info at /home/jlayton/git/linux/fs/nfsd/xdr4.h:752
 747 	#define NFS4_SVC_XDRSIZE		sizeof(struct nfsd4_compoundargs)
 748 	
 749 	static inline void
 750 	set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
 751 	{
>752<		BUG_ON(!fhp->fh_pre_saved);
 753 		cinfo->atomic = (u32)(fhp->fh_post_saved && !fhp->fh_no_atomic_attr);
 754 	
 755 		cinfo->before_change = fhp->fh_pre_change;
 756 		cinfo->after_change = fhp->fh_post_change;
 757 	}

(inlined by) do_open_lookup at /home/jlayton/git/linux/fs/nfsd/nfs4proc.c:502
 497 		accmode = NFSD_MAY_NOP;
 498 		if (open->op_created ||
 499 				open->op_claim_type == NFS4_OPEN_CLAIM_DELEGATE_CUR)
 500 			accmode |= NFSD_MAY_OWNER_OVERRIDE;
 501 		status = do_open_permission(rqstp, *resfh, open, accmode);
>502<		set_change_info(&open->op_cinfo, current_fh);
 503 	out:
 504 		if (status)
 505 			inode_unlock(current_fh->fh_dentry->d_inode);
 506 		return status;
 507 	}

(inlined by) nfsd4_open at /home/jlayton/git/linux/fs/nfsd/nfs4proc.c:625
 620 			goto out;
 621 	
 622 		switch (open->op_claim_type) {
 623 		case NFS4_OPEN_CLAIM_DELEGATE_CUR:
 624 		case NFS4_OPEN_CLAIM_NULL:
>625<			status = do_open_lookup(rqstp, cstate, open, &resfh);
 626 			if (status)
 627 				goto out;
 628 			locked = true;
 629 			break;
 630 		case NFS4_OPEN_CLAIM_PREVIOUS:

I haven't yet dug down into the actual cause yet.




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux