---------- Original e-mail ---------- From: John David Anglin To: linux-parisc@xxxxxxxxxxxxxxx CC: Helge Deller Date: 5. 5. 2024 19:07:17 Subject: [PATCH] parisc: Try to fix random segmentation faults in package builds > The majority of random segmentation faults that I have looked at > appear to be memory corruption in memory allocated using mmap and > malloc. This got me thinking that there might be issues with the > parisc implementation of flush_anon_page. > > [...] > > Lightly tested on rp3440 and c8000. Hello, thank you very much for working on the issue and for the patch! I tested it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches. My machine is affected heavily by the segfaults – with some kernel configurations, I get several per hour when compiling Gentoo packages on all four cores. This patch doesn't fix them, though. On the patched kernel, it happened after ~8h of uptime during installation of the perl-core/Test-Simple package. I got no error output from the running program, but an HPMC was logged to the serial console: [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030. <Cpu3> 78000c6203e00000 a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING <Cpu0> e800009800e00000 0000000041093be4 CC_ERR_CHECK_HPMC <Cpu1> e800009801e00000 00000000404ce130 CC_ERR_CHECK_HPMC <Cpu3> 76000c6803e00000 0000000000000520 CC_PAT_DATA_FIELD_WARNING <Cpu0> 37000f7300e00000 84000[30007.188321] Backtrace: [30007.188321] [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150 [30007.188321] [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8 [30007.188321] [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0 [30007.188321] [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0 [30007.188321] [<00000000401e95c0>] handle_interruption+0x330/0xe60 [30007.188321] [<0000000040295b44>] schedule_tail+0x78/0xe8 [30007.188321] [<00000000401e0f6c>] finish_child_return+0x0/0x58 A longer excerpt of the logs is attached. The error happened at boot time 30007, the preceding unaligned accesses seem to be unrelated. The patch didn't apply cleanly, but all hunks succeeded with some offsets and fuzz. This may also be a part of it – I didn't check the code for merge conflicts manually. If you want me to provide you with more logs (such as the HPMC dumps) or run some experiments, let me know. Some speculation about the cause of the errors follows: I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on the same machine. The errors seem to be more frequent with a heavy IO load, so it might be system-bus or PCI-bus-related. Using X11 causes lockups rather quickly, but that could be caused by unrelated errors in the graphics subsystem and/or the Radeon drivers. Limiting the machine to a single socket (2 cores) by disabling the other socket in firmware, or even booting on a single core using a maxcpus=1 kernel cmdline option, decreases the error frequency, but doesn't prevent them completely, at least on an (unpatched) 6.1 kernel. So it's probably not an SMP bug. If it's related to cache coherency, it's coherency between the CPUs and bus IO. The errors typically manifest as a null page access to a very low address, so probably a null pointer dereference. I think the kernel accidentally maps a zeroed page in place of one that the program was using previously, making it load (and subsequently dereference) a null pointer instead of a valid one. There are two problems with this theory, though: 1. It would mean the program could also load zeroed /data/ instead of a zeroed /pointer/, causing data corruption. I never conclusively observed this, although I am getting GCC ICEs from time to time, which could be explained by data corruption. 2. The segfault is sometimes preceded by an unaligned access, which I believe is also caused by a corrupted machine state rather than by a coding error in the program – sometimes a bunch of unaligned accesses show up in the logs just prior to a segfault / lockup, even from unrelated programs such as random bash processes. Sometimes the machine keeps working afterwards (although I typically reboot it immediately to limit the consequences of potential kernel data structure damage), sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild zeroed page appearance. But this typically happens when running X11, so again, it might be caused by another bug, such as the GPU randomly writing to memory via misconfigured DMA.
Attachment:
parisc-hpmc-6.8.9-patched.log
Description: Binary data