On Fri, 2022-07-15 at 12:36 +1000, NeilBrown wrote: > On Thu, 14 Jul 2022, Chuck Lever III wrote: > > > > > On Jul 13, 2022, at 12:32 AM, NeilBrown <neilb@xxxxxxx> wrote: > > > > > > On Wed, 13 Jul 2022, Chuck Lever III wrote: > > > > > > > > > On Jul 11, 2022, at 10:33 PM, NeilBrown <neilb@xxxxxxx> wrote: > > > > > > > > > > On Thu, 07 Jul 2022, Chuck Lever III wrote: > > > > > > > > > > > > > On Jul 6, 2022, at 12:18 AM, NeilBrown <neilb@xxxxxxx> wrote: > > > > > > > > > > > > > > This series prepares NFSD to be able to adjust to work with a proposed > > > > > > > patch which allows updates to directories to happen in parallel. > > > > > > > This patch set changes the way directories are locked, so the present > > > > > > > series cleans up some locking in nfsd. > > > > > > > > > > > > > > Specifically we remove fh_lock() and fh_unlock(). > > > > > > > These functions are problematic for a few reasons. > > > > > > > - they are deliberately idempotent - setting or clearing a flag > > > > > > > so that a second call does nothing. This makes locking errors harder, > > > > > > > but it results in code that looks wrong ... and maybe sometimes is a > > > > > > > little bit wrong. > > > > > > > Early patches clean up several places where this idempotent nature of > > > > > > > the functions is depended on, and so makes the code clearer. > > > > > > > > > > > > > > - they transparently call fh_fill_pre/post_attrs(), including at times > > > > > > > when this is not necessary. Having the calls only when necessary is > > > > > > > marginally more efficient, and arguably makes the code clearer. > > > > > > > > > > > > > > nfsd_lookup() currently always locks the directory, though often no lock > > > > > > > is needed. So a patch in this series reduces this locking. > > > > > > > > > > > > > > There is an awkward case that could still be further improved. > > > > > > > NFSv4 open needs to ensure the file is not renamed (or unlinked etc) > > > > > > > between the moment when the open succeeds, and a later moment when a > > > > > > > "lease" is attached to support a delegation. The handling of this lock > > > > > > > is currently untidy, particularly when creating a file. > > > > > > > It would probably be better to take a lease immediately after > > > > > > > opening the file, and then discarding if after deciding not to provide a > > > > > > > delegation. > > > > > > > > > > > > > > I have run fstests and cthon tests on this, but I wouldn't be surprised > > > > > > > if there is a corner case that I've missed. > > > > > > > > > > > > Hi Neil, thanks for (re)posting. > > > > > > > > > > > > Let me make a few general comments here before I send out specific > > > > > > review nits. > > > > > > > > > > > > I'm concerned mostly with how this series can be adequately tested. > > > > > > The two particular areas I'm worried about: > > > > > > > > > > > > - There are some changes to NFSv2 code, which is effectively > > > > > > fallow. I think I can run v2 tests, once we decide what tests > > > > > > should be run. > > > > > > > > > > I hope we can still test v2... I know it is disabled by default.. > > > > > If we can't test it, we should consider removing it. > > > > > > > > The work of deprecating and removing NFSv2 is already under way. > > > > I think what I'm asking is if /you/ have tested the NFSv2 changes. > > > > > > That's a question I can answer. I haven't. I will. > > > > Just in case it's not clear by now, NFSv2 (and NFSv3) > > stability is the reason I don't want to perturb the > > NFSD VFS API code in any significant way unless it's > > absolutely needed. > > "absolutely" seems like a rather high bar. It makes the goal of > "stability" seem more like "ossification". > Certainly we don't want to break things. Equally certainly we will. > I don't think "hold back useful changes out of fear" is the path to > success. Testing, review, and responding to bug reports is what we find > works best - and what we generally do. > > And I wouldn't class NFSv3 at all with NFSv2 (not even in parentheses). > NFSv2 has very few (if any) users in the broad community, so bugs are > likely to go unnoticed for extended periods. NFSv3 is, I believe, still > widely used, though not as widely as v4. There are use-cases where I > would recommend v3. > > e.g. a case could certainly be made to not extend my directory-locking > changes to v2-specific code, but v3 should get them. > > > > > > > > > > > Secondarily, the series adds more bells and whistles to the generic > > > > > > NFSD VFS APIs on behalf of NFSv4-specific requirements. In particular: > > > > > > > > > > > > - ("NFSD: change nfsd_create() to unlock directory before returning.") > > > > > > makes some big changes to nfsd_create(). But that helper itself > > > > > > is pretty small. Seems like cleaner code would result if NFSv4 > > > > > > had its own version of nfsd_create() to deal with the post-change > > > > > > cases. > > > > > > > > > > I would not like that approach. Duplicating code is rarely a good idea. > > > > > > > > De-duplicating code /can/ be a good idea, but isn't always a good > > > > idea. If the exceptional cases add a lot of logic, that can make the > > > > de-duplicated code difficult to read and reason about, and it can > > > > make it brittle, just as it does in this case. Modifications on > > > > behalf of NFSv4 in this common piece of code is possibly hazardous > > > > to NFSv3, and navigating around the exception logic makes it > > > > difficult to understand and review. > > > > > > Are we looking at the same code? > > > The sum total of extra code needed for v4 is: > > > - two extra parameters: > > > void (*post_create)(struct svc_fh *fh, void *data), > > > void *data) > > > - two lines of code: > > > if (!err && post_create) > > > post_create(resfhp, data); > > > > > > does that really make anything hard to follow? > > > > The synopsis of nfsd_create() is already pretty cluttered. > > Would it help to collect related details (name, type, attributes, acl, > label) into a struct and pass a reference to that around? > One awkwardness in my latest patch is that the acl and label are not set > in nfsd_create_setattr(). If they were passed around together with the > attributes, that would be a lot easier to achieve. > > > > > You're adding that extra code in nfsd_symlink() too IIRC? And, > > this change adds a virtual function call (which has significant > > overhead these days) for reasons of convenience rather than > > necessity. > > "significant"? In context of creation a file, I don't think one more > virtual function call is all that significant? > > I started writing code to avoid the function pointer, but the function > args list exploded. If you could be happy with e.g. a struct > nfs_create_args which contains attrs, acl, label, and was passed into > nfsd_create_setattr(), then I think that would cleanly avoid the desire > for a function pointer. > > > > > > > > > IMO code duplication is not an appropriate design pattern in this > > > > specific instance. > > > > > > I'm guessing there is a missing negation in there. > > > > Yes, thanks. > > > > > > > > > Maybe, rather than passing a function and void* to nfsd_create(), we > > > > > could pass an acl and a label and do the setting in vfs.c rather then > > > > > nfs4proc.c. The difficult part of that approach is getting back the > > > > > individual error statuses. That should be solvable though. > > > > > > > > The bulk of the work in nfsd_create() is done by lookup_one_len() > > > > and nfsd_create_locked(), both of which are public APIs. The rest > > > > of nfsd_create() is code that appears in several other places > > > > already. > > > > > > "several" == 1. The only other call site for nfsd_create_locked() is in > > > nfsd_proc_create() > > > > But there are 15 call sites under fs/nfsd/ for lookup_one_len(). > > It's a pretty common trope. > > > > > > > I would say that the "bulk" of the work is the interplay between > > > locking, error checking, and these two functions you mention. > > > > You're looking at the details of your particular change, and > > I'm concerned about how much technical debt is going to > > continue to accrue in an area shared with legacy protocol > > implementation. > > > > I'm still asking myself if I can live with the extra parameters > > and virtual function call. Maybe. IMO, there are three ways > > forward: > > > > 1. I can merge your patch and move on. > > 2. I can merge your patch as it is and follow up with clean-up. > > 3. I can drop your patch and write it the way I prefer. > > > > > > > > > > - ("NFSD: reduce locking in nfsd_lookup()") has a similar issue: > > > > > > nfsd_lookup() is being changed such that its semantics are > > > > > > substantially different for NFSv4 than for others. This is > > > > > > possibly an indication that nfsd_lookup() should also be > > > > > > duplicated into the NFSv4-specific code and the generic VFS > > > > > > version should be left alone. > > > > > > > > > > Again, I don't like duplication. In this case, I think the longer term > > > > > solution is to remove the NFSv4 specific locking differences and solve > > > > > the problem differently. i.e. don't hold the inode locked, but check > > > > > for any possible rename after getting a lease. Once that is done, > > > > > nfsd_lookup() can have saner semantics. > > > > > > > > Then, perhaps we should bite that bullet and do that work now. > > > > > > While this does have an appeal, it also looks like the start of a rabbit > > > hole where I find have to fix up a growing collection of problems before > > > my patch can land. > > > > That's kind of the nature of the beast. > > > > You are requesting changes that add technical debt with the > > promise that "sometime in the future" that debt will be > > erased. Meanwhile, nfsd_lookup() will be made harder to > > maintain, and it will continue to contain a real bug. > > > > I'm saying if there's a real bug here, addressing that should > > be the priority. > > > > > > > I think balance is needed - certainly asking for some preliminary > > > cleanup is appropriate. Expecting long standing and subtle issues that > > > are tangential to the main goal to be resolved first is possibly asking > > > too much. > > > > Fixing the bug seems to me (perhaps naively) to be directly > > related to the parallelism of directory operations. Why do > > you think it is tangential? > > > > The bug was exposed by the analysis required for the changes I want, but > I think that is as close as the connection goes. > To really see if it is tangential, we would need to come up with a > solution and see how easily it is ported across my patches. > > As I said in a reply to Jeff, I don't think locking of the parent > directory should be part of the solution. After getting a lease, we > repeat the lookup and see if dentry has changed. After my patches, that > would mean calling lookup_one_len_unlocked(). Before my patches, it > would mean calling fh_unlock() earlier to ensure the directory is no > longer locked but still calling lookup_one_len_unlocked() > > > > Asking for bugs to be addressed before new features go in > > seems sensible to me, and is a common practice to enable the > > resulting fixes to be backported. If you don't want to address > > the bug you mentioned, then someone else will need to, and > > that's fine. I think the priority should be bug fixes first. > > As a general principle I would agree. > In this case the bug is minor, long standing, and tangential. > And the series imposes enough review burden as it is. > > But if Jeff can produce a fix, either to be applied before or after, > then that would be an excellent solution. > > Thanks, > NeilBrown > > FWIW, I hit this while running fstests against the server with some draft changes. This crash is not in an area I touched, so I suspect something in Neil's locking cleanup: [ 434.257242] ------------[ cut here ]------------ [ 434.257278] kernel BUG at fs/nfsd/xdr4.h:752! [ 434.257755] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 434.257937] CPU: 2 PID: 1202 Comm: nfsd Kdump: loaded Tainted: G OE 5.19.0-rc5+ #316 [ 434.258179] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.0-3.el9 04/01/2014 [ 434.258371] RIP: 0010:nfsd4_open+0x1940/0x1a30 [nfsd] [ 434.258615] Code: 40 e8 54 04 e2 e6 48 8b 7c 24 40 41 89 c4 e8 57 7e d4 e6 41 f7 d4 4c 8b 44 24 60 66 44 21 63 44 48 8b 54 24 40 e9 46 fc ff ff <0f> 0b 48 8d bd 88 00 00 00 e8 72 80 d4 e6 4c 8b ad 88 00 00 00 49 [ 434.259053] RSP: 0018:ffff888134897c10 EFLAGS: 00010246 [ 434.259211] RAX: 0000000000000000 RBX: ffff8881348791a0 RCX: ffffffffc07fbb44 [ 434.259408] RDX: 1ffff1102691001a RSI: 0000000000000002 RDI: ffff8881348800d1 [ 434.259606] RBP: ffff888134880030 R08: 0000000000000000 R09: 0000000000000001 [ 434.259802] R10: fffffbfff542dfac R11: 0000000000000001 R12: 0000000000000000 [ 434.260000] R13: ffff888133ab8000 R14: 0000000000000000 R15: ffff8881165db400 [ 434.260195] FS: 0000000000000000(0000) GS:ffff888225500000(0000) knlGS:0000000000000000 [ 434.260466] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 434.260697] CR2: 0000558890b5d178 CR3: 0000000107844000 CR4: 00000000003506e0 [ 434.260949] Call Trace: [ 434.261106] <TASK> [ 434.261260] ? nfsd4_decode_notsupp+0x10/0x10 [nfsd] [ 434.261549] ? nfsd4_interssc_connect+0x8e0/0x8e0 [nfsd] [ 434.261858] ? preempt_count_sub+0x14/0xc0 [ 434.262054] ? percpu_counter_add_batch+0x60/0xd0 [ 434.262261] nfsd4_proc_compound+0x8c6/0xe90 [nfsd] [ 434.262553] nfsd_dispatch+0x2a9/0x5c0 [nfsd] [ 434.262836] svc_process_common+0x6ab/0xc30 [sunrpc] [ 434.263162] ? svc_create_pooled+0x390/0x390 [sunrpc] [ 434.263484] ? nfsd_svc+0x830/0x830 [nfsd] [ 434.263755] ? svc_recv+0xabf/0x1310 [sunrpc] [ 434.264074] svc_process+0x1a3/0x200 [sunrpc] [ 434.264382] nfsd+0x1d7/0x390 [nfsd] [ 434.264640] ? nfsd_shutdown_threads+0x200/0x200 [nfsd] [ 434.264926] kthread+0x167/0x1a0 [ 434.265101] ? kthread_complete_and_exit+0x20/0x20 [ 434.265307] ret_from_fork+0x22/0x30 [ 434.265494] </TASK> [ 434.265652] Modules linked in: rpcsec_gss_krb5(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) rfkill(E) nf_tables(E) nfnetlink(E) iTCO_wdt(E) intel_rapl_msr(E) intel_pmc_bxt(E) iTCO_vendor_support(E) joydev(E) intel_rapl_common(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) lpc_ich(E) nfsd(OE) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) drm(E) fuse(E) sunrpc(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) virtio_blk(E) virtio_console(E) virtio_net(E) net_failover(E) failover(E) qemu_fw_cfg(E) [ 434.267660] ---[ end trace 0000000000000000 ]--- [ 434.267870] RIP: 0010:nfsd4_open+0x1940/0x1a30 [nfsd] [ 434.268161] Code: 40 e8 54 04 e2 e6 48 8b 7c 24 40 41 89 c4 e8 57 7e d4 e6 41 f7 d4 4c 8b 44 24 60 66 44 21 63 44 48 8b 54 24 40 e9 46 fc ff ff <0f> 0b 48 8d bd 88 00 00 00 e8 72 80 d4 e6 4c 8b ad 88 00 00 00 49 [ 434.268771] RSP: 0018:ffff888134897c10 EFLAGS: 00010246 [ 434.268995] RAX: 0000000000000000 RBX: ffff8881348791a0 RCX: ffffffffc07fbb44 [ 434.269247] RDX: 1ffff1102691001a RSI: 0000000000000002 RDI: ffff8881348800d1 [ 434.269511] RBP: ffff888134880030 R08: 0000000000000000 R09: 0000000000000001 [ 434.269775] R10: fffffbfff542dfac R11: 0000000000000001 R12: 0000000000000000 [ 434.270030] R13: ffff888133ab8000 R14: 0000000000000000 R15: ffff8881165db400 [ 434.270288] FS: 0000000000000000(0000) GS:ffff888225500000(0000) knlGS:0000000000000000 [ 434.270632] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 434.270862] CR2: 0000558890b5d178 CR3: 0000000107844000 CR4: 00000000003506e0 faddr2line says: $ ./scripts/faddr2line --list fs/nfsd/nfsd.ko nfsd4_open+0x1940 nfsd4_open+0x1940/0x1a30: set_change_info at /home/jlayton/git/linux/fs/nfsd/xdr4.h:752 747 #define NFS4_SVC_XDRSIZE sizeof(struct nfsd4_compoundargs) 748 749 static inline void 750 set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp) 751 { >752< BUG_ON(!fhp->fh_pre_saved); 753 cinfo->atomic = (u32)(fhp->fh_post_saved && !fhp->fh_no_atomic_attr); 754 755 cinfo->before_change = fhp->fh_pre_change; 756 cinfo->after_change = fhp->fh_post_change; 757 } (inlined by) do_open_lookup at /home/jlayton/git/linux/fs/nfsd/nfs4proc.c:502 497 accmode = NFSD_MAY_NOP; 498 if (open->op_created || 499 open->op_claim_type == NFS4_OPEN_CLAIM_DELEGATE_CUR) 500 accmode |= NFSD_MAY_OWNER_OVERRIDE; 501 status = do_open_permission(rqstp, *resfh, open, accmode); >502< set_change_info(&open->op_cinfo, current_fh); 503 out: 504 if (status) 505 inode_unlock(current_fh->fh_dentry->d_inode); 506 return status; 507 } (inlined by) nfsd4_open at /home/jlayton/git/linux/fs/nfsd/nfs4proc.c:625 620 goto out; 621 622 switch (open->op_claim_type) { 623 case NFS4_OPEN_CLAIM_DELEGATE_CUR: 624 case NFS4_OPEN_CLAIM_NULL: >625< status = do_open_lookup(rqstp, cstate, open, &resfh); 626 if (status) 627 goto out; 628 locked = true; 629 break; 630 case NFS4_OPEN_CLAIM_PREVIOUS: I haven't yet dug down into the actual cause yet.