On Fri, 08 Sep 2023, Chuck Lever III wrote: > > > On Sep 7, 2023, at 5:05 PM, NeilBrown <neilb@xxxxxxx> wrote: > > > > On Fri, 08 Sep 2023, Chuck Lever III wrote: > >> > >>> On Sep 5, 2023, at 11:01 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > >>> > >>>> On Sep 5, 2023, at 10:54 AM, Christian Brauner <brauner@xxxxxxxxxx> wrote: > >>>> > >>>> Hey, > >>>> > >>>> I just tried to test some changes which had commit > >>>> 99d99825fc07 ("Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs") > >>>> as base and when I booted with the appended config I saw a splat right at boot: > >>>> > >>>> [ 92.804377][ T5306] kernel BUG at net/sunrpc/svc.c:581! > > > > The most likely cause for this BUG_ON that I can see is if the > > svc_set_num_threads() call in nfsd_svc() fails. > > > > Either some listening sockets had previously been created via > > write_ports(), or they have just been created in nfsd_startup_net(). > > ->sv_nrthreads is 0 and this is an attempt to increase that. > > However the thread creation fails - presumably ENOMEM. > > So we goto out_shutdown, stepping right over the nfsd_shutdown_net() > > ca.. > > Then the svc_put() calls (probably two, as keep_active was probably > > set). result in the kref reaching zero and svc_destroy() being called > > even though there are still sockets present (because nfsd_shutdown_net() > > was skipped). > > > > I tried > > - error = svc_set_num_threads(serv, NULL, nrservs); > > + error = -ENOMEM; //svc_set_num_threads(serv, NULL, nrservs); > > if (error) > > goto out_shutdown; > > > > and I get exactly the same BUG_ON() as you got. > > Christian, can you provide the arguments your system uses for > rpc.nfsd? I'm not sure which distribution you're using, so I > can't provide the exact steps. /etc/nfs.conf is one place to > look. I'd also be interested in what the changes you were testing were, and whether they might cause a memory allocation failure. If the BUG_ON is reproducible, can you check if the is a memory allocation failure, and which one? > > Neil, do we really need a BUG_ON for this assertion? I'm > considering making it a simple pr_warn(). Interested in > opinions. Maybe replace the BUG_ON() will a call to svc_xprt_destroy_all() and maybe svc_rpcb_cleanup(). > > Past that, obviously input checking is missing here, so the > error flows in nfsd_svc() need improvement. Like: diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index 5e455ced0711..f2c4e62e4591 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -819,12 +819,12 @@ nfsd_svc(int nrservs, struct net *net, const struct cred *cred) if (error) goto out_shutdown; error = serv->sv_nrthreads; - if (error == 0) - nfsd_last_thread(net); out_shutdown: if (error < 0 && !nfsd_up_before) nfsd_shutdown_net(net); out_put: + if (serv->sv_nrthreads == 0) + nfsd_last_thread(net); /* Threads now hold service active */ if (xchg(&nn->keep_active, 0)) svc_put(serv); ?? NeilBrown > > > > NeilBrown > > > > > >>>> [ 92.811194][ T5306] invalid opcode: 0000 [#1] PREEMPT SMP KASAN > >>>> [ 92.821472][ T5306] CPU: 6 PID: 5306 Comm: rpc.nfsd Tainted: G > >>>> [ 92.828578][ T5306] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS unknown 2/2/2022 > >>>> [ 92.836319][ T5306] RIP: 0010:svc_destroy+0x206/0x270 > >>>> [ 92.852006][ T5306] Code: 72 49 8b bc 24 a0 00 00 00 e8 a6 a3 5e f8 48 8b 3c 24 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f e9 8f a3 5e f8 e8 aa df 1c f8 <0f> 0b e8 a3 df 1c f8 0f 0b 4c 89 ff e8 39 03 79 f8 e9 ae fe ff ff > >>>> [ 92.867075][ T5306] RSP: 0018:ffffc9000a347b60 EFLAGS: 00010293 > >>>> [ 92.872714][ T5306] RAX: 0000000000000000 RBX: ffff88813abf5c68 RCX: 0000000000000000 > >>>> [ 92.884809][ T5306] RDX: ffff888126c38000 RSI: ffffffff896bcf46 RDI: 0000000000000005 > >>>> [ 92.894190][ T5306] RBP: 00000000fffffff4 R08: 0000000000000005 R09: 0000000000000000 > >>>> [ 92.900512][ T5306] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88813abf5c50 > >>>> [ 92.907935][ T5306] R13: ffff88813abf5c50 R14: ffff88813abf5c00 R15: ffff8881183c8000 > >>>> [ 92.917264][ T5306] FS: 00007fabf0bba740(0000) GS:ffff8883a9100000(0000) knlGS:0000000000000000 > >>>> [ 92.924880][ T5306] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>> [ 92.930358][ T5306] CR2: 00005568a27d60e8 CR3: 00000001737c3000 CR4: 0000000000750ee0 > >>>> [ 92.937465][ T5306] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >>>> [ 92.943057][ T5306] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > >>>> [ 92.948673][ T5306] PKRU: 55555554 > >>>> [ 92.953452][ T5306] Call Trace: > >>>> [ 92.958082][ T5306] <TASK> > >>>> [ 92.962546][ T5306] ? show_regs+0x94/0xa0 > >>>> [ 92.967221][ T5306] ? die+0x3b/0xb0 > >>>> [ 92.971702][ T5306] ? do_trap+0x231/0x410 > >>>> [ 92.976275][ T5306] ? svc_destroy+0x206/0x270 > >>>> [ 92.980717][ T5306] ? do_error_trap+0xf9/0x230 > >>>> [ 92.985287][ T5306] ? svc_destroy+0x206/0x270 > >>>> [ 92.989693][ T5306] ? handle_invalid_op+0x34/0x40 > >>>> [ 92.994044][ T5306] ? svc_destroy+0x206/0x270 > >>>> [ 92.998317][ T5306] ? exc_invalid_op+0x2d/0x40 > >>>> [ 93.002503][ T5306] ? asm_exc_invalid_op+0x1a/0x20 > >>>> [ 93.006701][ T5306] ? svc_destroy+0x206/0x270 > >>>> [ 93.010766][ T5306] ? svc_destroy+0x206/0x270 > >>>> [ 93.014727][ T5306] nfsd_svc+0x6d4/0xac0 > >>>> [ 93.018510][ T5306] write_threads+0x296/0x4e0 > >>>> [ 93.022298][ T5306] ? write_filehandle+0x760/0x760 > >>>> [ 93.026072][ T5306] ? simple_transaction_get+0xf8/0x140 > >>>> [ 93.029819][ T5306] ? preempt_count_sub+0x150/0x150 > >>>> [ 93.033456][ T5306] ? do_raw_spin_lock+0x133/0x2c0 > >>>> [ 93.037013][ T5306] ? _copy_from_user+0x5d/0xf0 > >>>> [ 93.040385][ T5306] ? write_filehandle+0x760/0x760 > >>>> [ 93.043610][ T5306] nfsctl_transaction_write+0x100/0x180 > >>>> [ 93.046900][ T5306] vfs_write+0x2a9/0xe40 > >>>> [ 93.049930][ T5306] ? export_features_open+0x60/0x60 > >>>> [ 93.053124][ T5306] ? kernel_write+0x6c0/0x6c0 > >>>> [ 93.056116][ T5306] ? do_sys_openat2+0xb6/0x1e0 > >>>> [ 93.059167][ T5306] ? build_open_flags+0x690/0x690 > >>>> [ 93.062197][ T5306] ? __fget_light+0x201/0x270 > >>>> [ 93.065020][ T5306] ksys_write+0x134/0x260 > >>>> [ 93.067775][ T5306] ? __ia32_sys_read+0xb0/0xb0 > >>>> [ 93.070501][ T5306] ? rcu_is_watching+0x12/0xb0 > >>>> [ 93.073073][ T5306] ? trace_irq_enable.constprop.0+0xd0/0x100 > >>>> [ 93.075937][ T5306] do_syscall_64+0x38/0xb0 > >>>> [ 93.078394][ T5306] entry_SYSCALL_64_after_hwframe+0x63/0xcd > >>>> > >>>> I haven't spent time debugging this further. Maybe you see the issue right > >>>> away. > >>> > >>> I don't, unfortunately. A bisect would be appropriate. > >>> > >>> I will pull today's master branch and see if I can reproduce. > >> > >> I wasn't able to reproduce this with yesterday's master. I don't > >> recall anything in Anna's NFS client PR that might account for > >> this crash. > >> > >> Neil, I think you were the last person to touch the code in and > >> around svc_destroy(). Can you have a look at this? > >> > >> > >>>> This problem is only happening after the nfs merges afaict. I'm > >>>> currently using commit 3ef96fcfd50b ("Merge tag 'ext4_for_linus-6.6-rc1' > >>>> of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4") as base > >>>> and that splat doesn't appear. > >>>> > >>>> Hopefully this is not a red herring. > >>>> Christian > >>>> <.config.txt> > >>> > >>> -- > >>> Chuck Lever > >> > >> > >> -- > >> Chuck Lever > >> > >> > >> > > > > -- > Chuck Lever > > >