Thorston, Thanks, but I should just say that I'm not certain this is a regression yet - it could just be a change in our workload that is triggering something I haven't seen before. I am slowly working back through kernel versions to verify that - but it's really hard to trigger and does not happen often so it is slow going. Also my workload and configuration is quite unique (NFS re-exporting) so I may be the only one seeing this... Cheers, Daire On Sun, 16 Oct 2022 at 12:21, Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> wrote: > > Hi, this is your Linux kernel regression tracker speaking. > > I noticed a regression report in bugzilla.kernel.org. As many (most?) > kernel developer don't keep an eye on it, I decided to forward it by > mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=216582 : > > > Daire Byrne 2022-10-13 22:04:19 UTC > > > > Hi, > > > > I've started seeing this crash at least once or twice a week with our > > NFS re-export workloads (re-exporting a Linux NFsv3 server as > > NFSv3). > > > > We have been stepping through kernel versions a bit on the server > > recently so it feels like something new introduced somewhere around > > v5.17 but I also can't rule out that our clients are doing something > > "different" with their workloads to stress this code in some new way. > > It still occurs in v6.0 too. > > > > [106412.314663] BUG: kernel NULL pointer dereference, address: 0000000000000020 > > [106412.321879] #PF: supervisor read access in kernel mode > > [106412.327237] #PF: error_code(0x0000) - not-present page > > [106412.332599] PGD 0 P4D 0 > > [106412.335353] Oops: 0000 [#1] PREEMPT SMP NOPTI > > [106412.339935] CPU: 34 PID: 2382 Comm: lockd Tainted: G E 5.18.10-1.dneg.x86_64 #1 > > [106412.348773] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 > > [106412.358223] RIP: 0010:nlmclnt_setlockargs+0x4a/0x100 [lockd] > > [106412.364116] Code: 00 00 49 81 c0 88 00 00 00 f0 0f c1 05 bf 06 01 00 83 c0 01 c7 47 30 04 00 00 00 48 8d 4f 44 48 8d 7f 4c 89 47 c4 48 8b 46 78 <48> 8b 40 20 48 8b 90 60 fe ff ff 48 8d b0 60 fe ff ff 48 89 57 f8 > > [106412.383117] RSP: 0018:ffffb3db50cdfa80 EFLAGS: 00010202 > > [106412.388569] RAX: 0000000000000000 RBX: ffff8a36749c9400 RCX: ffff8a36749c9444 > > [106412.395924] RDX: ffff8a37f8696300 RSI: ffffb3db50cdfbd8 RDI: ffff8a36749c944c > > [106412.403277] RBP: ffffb3db50cdfa90 R08: ffff8a750b49bc88 R09: ffff8a37f8696300 > > [106412.410634] R10: 0000000000000230 R11: ffffffffffffffff R12: ffffb3db50cdfbd8 > > [106412.417984] R13: ffff8a7508beac00 R14: ffffb3db50cdfca0 R15: ffffb3db50cdfbd8 > > [106412.425338] FS: 0000000000000000(0000) GS:ffff8a73ffa80000(0000) knlGS:0000000000000000 > > [106412.433649] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [106412.439611] CR2: 0000000000000020 CR3: 00000001118e6006 CR4: 00000000003706e0 > > [106412.446984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [106412.454346] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [106412.461696] Call Trace: > > [106412.464361] <TASK> > > [106412.466689] nlmclnt_proc+0x1c6/0x5b0 [lockd] > > [106412.471272] nfs3_proc_lock+0x33/0xb0 [nfsv3] > > [106412.475848] ? nfs_put_lock_context+0x86/0x90 [nfs] > > [106412.481008] do_unlk+0x8f/0xd0 [nfs] > > [106412.484837] nfs_lock+0xcd/0x180 [nfs] > > [106412.488815] ? nlmsvc_mark_host+0x30/0x30 [lockd] > > [106412.493752] vfs_lock_file+0x1e/0x40 > > [106412.497547] nlm_unlock_files.isra.0+0x6d/0xc0 [lockd] > > [106412.502905] nlm_traverse_files+0x163/0x2a0 [lockd] > > [106412.508020] nlmsvc_free_host_resources+0x2b/0x40 [lockd] > > [106412.513648] nlm_host_rebooted+0x2c/0x90 [lockd] > > [106412.518483] nlmsvc_proc_sm_notify+0xc0/0x130 [lockd] > > [106412.523759] ? nlmsvc_decode_reboot+0x7d/0xa0 [lockd] > > [106412.529027] nlmsvc_dispatch+0x8e/0x1a0 [lockd] > > [106412.534312] svc_process_common+0x484/0x620 [sunrpc] > > [106412.539521] ? lockd+0x1d0/0x1d0 [lockd] > > [106412.543661] ? set_grace_period+0xa0/0xa0 [lockd] > > [106412.548582] svc_process+0xbc/0xf0 [sunrpc] > > [106412.553008] lockd+0xd2/0x1d0 [lockd] > > [106412.556906] ? set_grace_period+0xa0/0xa0 [lockd] > > [106412.561849] kthread+0xee/0x120 > > [106412.565228] ? kthread_complete_and_exit+0x20/0x20 > > [106412.570239] ret_from_fork+0x1f/0x30 > > [106412.574033] </TASK> > > [106412.576436] Modules linked in: tcp_diag(E) inet_diag(E) nfsv3(E) nfs(E) cachefiles(E) fscache(E) netfs(E) ext4(E) mbcache(E) jbd2(E) intel_uncore_frequency_common(E) isst_if_common(E) sg(E) nfit(E) virtio_rng(E) rapl(E) i2c_piix4(E) input_leds(E) nfsd(E) sch_fq(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) tcp_bbr(E) binfmt_misc(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E) t10_pi(E) crc64_rocksoft_generic(E) crc64_rocksoft(E) crc64(E) crct10dif_pclmul(E) crc32_pclmul(E) virtio_scsi(E) crc32c_intel(E) ghash_clmulni_intel(E) 8021q(E) garp(E) mrp(E) virtio_pci(E) scsi_transport_iscsi(E) virtio_pci_legacy_dev(E) aesni_intel(E) virtio_pci_modern_dev(E) crypto_simd(E) virtio_ring(E) cryptd(E) gve(E) serio_raw(E) virtio(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) fuse(E) > > [106412.646242] CR2: 0000000000000020 > > [106412.649780] ---[ end trace 0000000000000000 ]--- > > [106412.654617] RIP: 0010:nlmclnt_setlockargs+0x4a/0x100 [lockd] > > [106412.660495] Code: 00 00 49 81 c0 88 00 00 00 f0 0f c1 05 bf 06 01 00 83 c0 01 c7 47 30 04 00 00 00 48 8d 4f 44 48 8d 7f 4c 89 47 c4 48 8b 46 78 <48> 8b 40 20 48 8b 90 60 fe ff ff 48 8d b0 60 fe ff ff 48 89 57 f8 > > [106412.679481] RSP: 0018:ffffb3db50cdfa80 EFLAGS: 00010202 > > [106412.684922] RAX: 0000000000000000 RBX: ffff8a36749c9400 RCX: ffff8a36749c9444 > > [106412.692269] RDX: ffff8a37f8696300 RSI: ffffb3db50cdfbd8 RDI: ffff8a36749c944c > > [106412.699617] RBP: ffffb3db50cdfa90 R08: ffff8a750b49bc88 R09: ffff8a37f8696300 > > [106412.706969] R10: 0000000000000230 R11: ffffffffffffffff R12: ffffb3db50cdfbd8 > > [106412.714329] R13: ffff8a7508beac00 R14: ffffb3db50cdfca0 R15: ffffb3db50cdfbd8 > > [106412.721676] FS: 0000000000000000(0000) GS:ffff8a73ffa80000(0000) knlGS:0000000000000000 > > [106412.729981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [106412.736472] CR2: 0000000000000020 CR3: 00000001118e6006 CR4: 00000000003706e0 > > [106412.743821] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [106412.751171] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [106412.758520] Kernel panic - not syncing: Fatal exception > > [106412.764850] Kernel Offset: 0x30000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > [106412.775850] ---[ end Kernel panic - not syncing: Fatal exception ]--- > > > > > > All I know is that I didn't notice this crash from v5.12 to v5.16 but > > I have not been able to test this qualitatively yet. The crash is > > rare enough that it makes A/B testing quite tricky. > > > > It's somewhat similar to > > https://bugzilla.kernel.org/show_bug.cgi?id=213273 but that was for a > > NFv4.2 re-export of NFSv3 and this is for a NFSv3 re-export of NFSv3 > > (for WAN caching). > > > > We are using nfs-utils-2.5.4. > > > > Daire > > See the ticket for more details. > > BTW, let me use this mail to also add the report to the list of tracked > regressions to ensure it's doesn't fall through the cracks: > > #regzbot introduced: v5.17..v5.18 > #regzbot ignore-activity > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > > P.S.: As the Linux kernel's regression tracker I deal with a lot of > reports and sometimes miss something important when writing mails like > this. If that's the case here, don't hesitate to tell me in a public > reply, it's in everyone's interest to set the public record straight.