On Thu, 20 Jun 2024 00:42:37 +0200, Erhard Furtner wrote: >> Le 29/02/2024 à 02:09, Erhard Furtner a écrit : >> > >> > Revisited the issue on kernel v6.8-rc6 and I can still reproduce it. >> > >> > Short summary as my last post was over a year ago: >> > (x) I get this memory corruption only when CONFIG_VMAP_STACK=y and CONFIG_SMP=y is enabled. >> > (x) I don't get this memory corruption when only one of the above is enabled. ^^ >> > (x) memtester says the 2 GiB RAM in my G4 DP are fine. >> > (x) I don't get this issue on my G5 11,2 or Talos II. >> > (x) "stress -m 2 --vm-bytes 965M" provokes the issue in < 10 secs. (https://salsa.debian.org/debian/stress) >> > > The "pagealloc: memory corruption" remains however as of kernel v6.10-rc4. I've reproduced the bug on similar hardware, also a dual-processor Power Mac G4 with 2 GiB RAM. With the 6.6.30 kernel without extra debugging options, the system was stable and could e.g. compile GCC or the kernel without an issue. That doesn't mean there wasn't silent corruption going on, of course. :-) Running the `stress` program as listed above did, however, cause the system to get into an unstable state where heavier workloads, such as compiling the kernel, would randomly fail. I updated the kernel to 6.10.3, enabled SLUB_DEBUG, PAGE_POISONING and DEBUG_PAGEALLOC and turned them on at boot-time with slub_debug=FZ page_poison=on debug_pagealloc=on. The updated kernel exhibits the same symptoms as described by Erhard, running `stress -m 2 --vm-bytes 965M` almost immediately causes a memory corruption with the following messages in dmesg: ``` pagealloc: memory corruption fffcfff0: 00 00 00 00 .... CPU: 1 PID: 1845 Comm: stress Tainted: G T 6.10.3-gentoo #1 Hardware name: PowerMac3,6 7455 0x80010303 PowerMac Call Trace: [f2d05ca0] [c08ff18c] dump_stack_lvl+0x60/0xbc (unreliable) [f2d05cc0] [c01db7e0] __kernel_unpoison_pages+0x128/0x1f0 [f2d05d10] [c01bc6c4] get_page_from_freelist+0xeb0/0xf6c [f2d05db0] [c01bcf7c] __alloc_pages_noprof+0x160/0xdf0 [f2d05e70] [c01be388] __folio_alloc_noprof+0x14/0x44 [f2d05e80] [c0199690] handle_mm_fault+0x99c/0xdac [f2d05f00] [c00218c8] do_page_fault+0x264/0x73c [f2d05f30] [c000433c] DataAccess_virt+0x124/0x17c --- interrupt: 300 at 0x7c2db0 NIP: 007c2db0 LR: 007c2d90 CTR: 00000000 REGS: f2d05f40 TRAP: 0300 Tainted: G T (6.10.3-gentoo) MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 20882004 XER: 00000000 DAR: 8fe18020 DSISR: 42000000 GPR00: 007c2d90 afb6a160 a7a00100 6b416020 ffffffa0 00000000 a7916ffc 00000000 GPR08: 24a03000 24a02000 00000000 404347fa 404344c7 00000000 00000000 0000005a GPR16: 6b416020 00000002 00000000 00000000 ffffffff 00000000 40882002 007e0004 GPR24: 00000001 ffffffff ffffffff 3c500000 00000000 66b7cd68 007e7cf8 00001000 NIP [007c2db0] 0x7c2db0 LR [007c2d90] 0x7c2d90 --- interrupt: 300 page: refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x31069 flags: 0x80000000(zone=2) raw: 80000000 00000100 00000122 00000000 00000000 00000000 ffffffff 00000001 page dumped because: pagealloc: corrupted page details ``` Other activity can also trigger it, compilation of larger programs with `make -j2` does it within an hour, typically resulting in an ICE. When booted with the `maxcpus=0` kernel parameter, the corruptions do not occur.