Re: BUG: kernel NULL pointer dereference, address: 0000000000000042

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(resending as text so it doesn't get rejected as spam by the list;
since I responded on Feb 12 we've seen it happen one more time)

On Thu, 2023-02-09 at 21:50 +0000, Matthew Wilcox wrote:
> On Fri, Feb 10, 2023 at 08:30:02AM +1100, Dave Chinner wrote:
> > [cc willy, linux-mm, as it crashed walking the page cache in the
> > generic fault code]
> 
> I've seen this one occasionally, and I'm not sure what's going on.
> I've never been able to reproduce it myself, and it seems to
> disappear
> for the people who have been able to reproduce it ;-(
> 
> It is 100% my fault and definitely caused by large folios.  In the
> XArray, large folios are represented by a folio pointer in the lowest
> index occupied by that folio and sibling entries in every other
> index,
> which redirect lookups to the canonical (ie lowest) entry.  This 0x42
> that you've managed to find in the XArray is a sibling entry.  It
> says that the entry we're actually looking for is at offset 0x10 of
> the node we're in.  
> 
> Something similar was fixed in commit 63b1898fffcd, but that was a
> sibling entry that ended up pointing to a node.  You've *presumably*
> hit some kind of temporary situation where the original sibling entry
> is no
> longer pointing to the folio entry that it should be.  However,
> there's
> another possibility, which is that this is not a temporary RCU-
> induced
> state, but we have corruption in the tree.  If we do have corruption,
> then you'll see an infinite loop instead of a crash.
> 
> If it's a temporary situation, this will fix it.
> 

I'm unfortunately not in a position to test a fix. 

> diff --git a/lib/xarray.c b/lib/xarray.c
> index ea9ce1f0b386..4237a9647a6a 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -207,7 +207,8 @@ static void *xas_descend(struct xa_state *xas,
> struct xa_node *node)
>         if (xa_is_sibling(entry)) {
>                 offset = xa_to_sibling(entry);
>                 entry = xa_entry(xas->xa, node, offset);
> -               if (node->shift && xa_is_node(entry))
> +               if (xa_is_sibling(entry) ||
> +                   (node->shift && xa_is_node(entry)))
>                         entry = XA_RETRY_ENTRY;
>         }
>  
> Please do let me know ... you say it's happened twice, but how many
> machine-hours did it take to hit twice?


That's hard to say. There are ~5 machines doing this work, the kernel
was installed in early February, so around 1000 machine-hours, but what
part of the time they were busy and how much of that they were running
the triggering workload, I can't say.

> 
> > On Thu, Feb 09, 2023 at 10:43:10AM +0200, Avi Kivity wrote:
> > > Workload: compilation and running unit tests. The task that
> > > crashed is
> > > a unit test.
> > > 
> > > Kernel: 6.1.8-200.fc37.x86_64
> > > 
> > > Previously known stable on 5.8.9-200.fc32.x86_64. Two crashes
> > > seen so
> > > far.
> > > 
> > > 
> > > Feb  7 17:19:33 localhost kernel: BUG: kernel NULL pointer
> > > dereference,
> > > address: 0000000000000042
> > > Feb  7 17:19:33 localhost kernel: #PF: supervisor read access in
> > > kernel
> > > mode
> > > Feb  7 17:19:33 localhost kernel: #PF: error_code(0x0000) - not-
> > > present
> > > page
> > > Feb  7 17:19:33 localhost kernel: PGD 80000001cbb1f067 P4D
> > > 80000001cbb1f067 PUD 9cbb75067 PMD 0 
> > > Feb  7 17:19:33 localhost kernel: Oops: 0000 [#1] PREEMPT SMP PTI
> > > Feb  7 17:19:33 localhost kernel: CPU: 24 PID: 3718328 Comm:
> > > transport_test Tainted: G S                 6.1.8-200.fc37.x86_64
> > > #1
> > > Feb  7 17:19:33 localhost kernel: Hardware name: Dell Inc.
> > > PowerEdge
> > > R730/0599V5, BIOS 2.9.1 12/04/2018
> > > Feb  7 17:19:33 localhost kernel: RIP:
> > > 0010:next_uptodate_page+0x46/0x200
> > > Feb  7 17:19:33 localhost kernel: Code: 0f 84 3f 01 00 00 48 81
> > > ff 06
> > > 04 00 00 0f 84 b3 00 00 00 48 81 ff 02 04 00 00 0f 84 37 01 00 00
> > > 40 f6
> > > c7 01 0f 85 9c 00 00 00 <48> 8b 07 a8 01 0f 85 91 00 00 00 8b 47
> > > 34 85
> > > c0 0f 84 86 00 00 00
> > > Feb  7 17:19:33 localhost kernel: RSP: 0000:ffffa83e4ed67cc8
> > > EFLAGS:
> > > 00010246
> > > Feb  7 17:19:33 localhost kernel: RAX: 0000000000000042 RBX:
> > > ffffa83e4ed67e00 RCX: 000000000000146e
> > > Feb  7 17:19:33 localhost kernel: RDX: ffffa83e4ed67d20 RSI:
> > > ffff94a9046316b0 RDI: 0000000000000042
> > > Feb  7 17:19:33 localhost kernel: RBP: ffffa83e4ed67d20 R08:
> > > 000000000000146e R09: 0000000000dfd000
> > > Feb  7 17:19:33 localhost kernel: R10: 000000000000145f R11:
> > > ffff94978b85960c R12: ffff94a9046316b0
> > > Feb  7 17:19:33 localhost kernel: R13: 000000000000146e R14:
> > > ffff94a9046316b0 R15: ffff948f8bb1f000
> > > Feb  7 17:19:33 localhost kernel: FS:  00007fd68fcb9d40(0000)
> > > GS:ffff949dffd00000(0000) knlGS:0000000000000000
> > > Feb  7 17:19:33 localhost kernel: CS:  0010 DS: 0000 ES: 0000
> > > CR0:
> > > 0000000080050033
> > > Feb  7 17:19:33 localhost kernel: CR2: 0000000000000042 CR3:
> > > 00000001dc1be005 CR4: 00000000001706e0
> > > Feb  7 17:19:33 localhost kernel: Call Trace:
> > > Feb  7 17:19:33 localhost kernel: <TASK>
> > > Feb  7 17:19:33 localhost kernel: filemap_map_pages+0x9f/0x7b0
> > > Feb  7 17:19:33 localhost kernel: xfs_filemap_map_pages+0x41/0x60
> > > [xfs]
> > > Feb  7 17:19:33 localhost kernel: do_fault+0x1bf/0x430
> > > Feb  7 17:19:33 localhost kernel: __handle_mm_fault+0x63d/0xe40
> > > Feb  7 17:19:33 localhost kernel: ? do_sigaction+0x11a/0x240
> > > Feb  7 17:19:33 localhost kernel: handle_mm_fault+0xdb/0x2d0
> > > Feb  7 17:19:33 localhost kernel: do_user_addr_fault+0x1cd/0x690
> > > Feb  7 17:19:33 localhost kernel: exc_page_fault+0x70/0x170
> > > Feb  7 17:19:33 localhost kernel: asm_exc_page_fault+0x22/0x30
> > > Feb  7 17:19:33 localhost kernel: RIP: 0033:0x1666350
> > > Feb  7 17:19:33 localhost kernel: Code: Unable to access opcode
> > > bytes
> > > at 0x1666326.
> > > Feb  7 17:19:33 localhost kernel: RSP: 002b:00007ffde7fa86d8
> > > EFLAGS:
> > > 00010212
> > > Feb  7 17:19:33 localhost kernel: RAX: 0000000000000000 RBX:
> > > 00007ffde7fa8748 RCX: 0000000002ed4468
> > > Feb  7 17:19:33 localhost kernel: RDX: 00006000000c4f50 RSI:
> > > 00007ffde7fa8748 RDI: 0000000000000012
> > > Feb  7 17:19:33 localhost kernel: RBP: 0000000000000012 R08:
> > > 0000000000000001 R09: 0000000002f46860
> > > Feb  7 17:19:33 localhost kernel: R10: 00007fd69219cac0 R11:
> > > 00007fd69224e670 R12: 0000000000000000
> > > Feb  7 17:19:33 localhost kernel: R13: 00006000000c4f50 R14:
> > > 0000000002ed4470 R15: 00007fd693be0000
> > > Feb  7 17:19:33 localhost kernel: </TASK>
> > > Feb  7 17:19:33 localhost kernel: Modules linked in: xsk_diag
> > > veth tls
> > > xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype
> > > nft_compat
> > > br_netfilter bridge stp llc intel_rapl_msr dell_wmi iTCO_wdt
> > > dell_smbios intel_pmc_bxt iTCO_vendor_support dell_wmi_descriptor
> > > ledtrig_audio sparse_keymap video dcdbas intel_rapl_common
> > > sb_edac
> > > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> > > ipmi_ssif
> > > irqbypass rapl intel_cstate intel_uncore ipmi_si ipmi_devintf
> > > ipmi_msghandler nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
> > > nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> > > nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > > rfkill
> > > overlay ip_set nf_tables nfnetlink qrtr acpi_power_meter mxm_wmi
> > > mei_me
> > > mei lpc_ich auth_rpcgss ip6_tables ip_tables sunrpc zram xfs
> > > crct10dif_pclmul crc32_pclmul nvme crc32c_intel polyval_clmulni
> > > polyval_generic ixgbe ghash_clmulni_intel nvme_core sha512_ssse3
> > > megaraid_sas tg3 mgag200 mdio nvme_common dca wmi scsi_dh_rdac
> > > scsi_dh_emc scsi_dh_alua
> > > Feb  7 17:19:33 localhost kernel: dm_multipath fuse
> > > Feb  7 17:19:33 localhost kernel: CR2: 0000000000000042
> > > Feb  7 17:19:33 localhost kernel: ---[ end trace 0000000000000000
> > > ]---
> > > Feb  7 17:19:33 localhost kernel: RIP:
> > > 0010:next_uptodate_page+0x46/0x200
> > > Feb  7 17:19:33 localhost kernel: Code: 0f 84 3f 01 00 00 48 81
> > > ff 06
> > > 04 00 00 0f 84 b3 00 00 00 48 81 ff 02 04 00 00 0f 84 37 01 00 00
> > > 40 f6
> > > c7 01 0f 85 9c 00 00 00 <48> 8b 07 a8 01 0f 85 91 00 00 00 8b 47
> > > 34 85
> > > c0 0f 84 86 00 00 00
> > > Feb  7 17:19:33 localhost kernel: RSP: 0000:ffffa83e4ed67cc8
> > > EFLAGS:
> > > 00010246
> > > Feb  7 17:19:33 localhost kernel: RAX: 0000000000000042 RBX:
> > > ffffa83e4ed67e00 RCX: 000000000000146e
> > > Feb  7 17:19:33 localhost kernel: RDX: ffffa83e4ed67d20 RSI:
> > > ffff94a9046316b0 RDI: 0000000000000042
> > > Feb  7 17:19:33 localhost kernel: RBP: ffffa83e4ed67d20 R08:
> > > 000000000000146e R09: 0000000000dfd000
> > > Feb  7 17:19:33 localhost kernel: R10: 000000000000145f R11:
> > > ffff94978b85960c R12: ffff94a9046316b0
> > > Feb  7 17:19:33 localhost kernel: R13: 000000000000146e R14:
> > > ffff94a9046316b0 R15: ffff948f8bb1f000
> > > Feb  7 17:19:33 localhost kernel: FS:  00007fd68fcb9d40(0000)
> > > GS:ffff949dffd00000(0000) knlGS:0000000000000000
> > > 
> > 





[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux