Re: BUG: Bad page map in process init pte:c0ab684c pmd:01182000 (on a PowerMac G4 DP)

Erhard Furtner <erhard_f@xxxxxxxxxxx> · Thu, 29 Feb 2024 02:09:41 +0100

On Mon, 12 Dec 2022 14:31:35 +1000
"Nicholas Piggin" <npiggin@xxxxxxxxx> wrote:

> On Thu Dec 1, 2022 at 7:44 AM AEST, Erhard F. wrote:
> > Getting this at boot sometimes, but not always (PowerMac G4 DP, kernel 6.0.9):
> >
> > [...]
> > Freeing unused kernel image (initmem) memory: 1328K
> > Checked W+X mappings: passed, no W+X pages found
> > rodata_test: all tests were successful
> > Run /sbin/init as init process
> > _swap_info_get: Bad swap file entry 24c0ab68
> > BUG: Bad page map in process init  pte:c0ab684c pmd:01182000  
> 
> Have you run memtest on the system? Are the messages related to a
> kernel upgrade? This and your KASAN bugs look possibly like random
> corruption.
> 
> Although with that KASAN one it's strange that kernfs_node_cache
> was involved both times, it's strange that page tables are pointing
> to that same slab memory. It could be a page table page use-after
> -free maybe? Maybe with the page table fragment code. I'm sure other
> people would have hit that before though, so I don't know what to
> suggest.
> 
> Thanks,
> Nick

Revisited the issue on kernel v6.8-rc6 and I can still reproduce it.

Short summary as my last post was over a year ago:
 (x) I get this memory corruption only when CONFIG_VMAP_STACK=y and CONFIG_SMP=y is enabled.
 (x) I don't get this memory corruption when only one of the above is enabled. ^^
 (x) memtester says the 2 GiB RAM in my G4 DP are fine.
 (x) I don't get this issue on my G5 11,2 or Talos II.
 (x) "stress -m 2 --vm-bytes 965M" provokes the issue in < 10 secs. (https://salsa.debian.org/debian/stress)

For the test I used CONFIG_KASAN_INLINE=y for v6.8-rc6 and debug_pagealloc=on, page_owner=on and got this dmesg:

[...]
pagealloc: memory corruption
f5fcfff0: 00 00 00 00                                      ....
CPU: 1 PID: 1788 Comm: stress Tainted: G    B              6.8.0-rc6-PMacG4 #15
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[f3bfbac0] [c162a8e8] dump_stack_lvl+0x60/0x94 (unreliable)
[f3bfbae0] [c04edf9c] __kernel_unpoison_pages+0x1e0/0x1f0
[f3bfbb30] [c04a8aa0] post_alloc_hook+0xe0/0x174
[f3bfbb60] [c04a8b58] prep_new_page+0x24/0xbc
[f3bfbb80] [c04abcc4] get_page_from_freelist+0xcd0/0xf10
[f3bfbc50] [c04aecd8] __alloc_pages+0x204/0xe2c
[f3bfbda0] [c04b07a8] __folio_alloc+0x18/0x88
[f3bfbdc0] [c0461a10] vma_alloc_zeroed_movable_folio.isra.0+0x2c/0x6c
[f3bfbde0] [c046bb90] handle_mm_fault+0x91c/0x19ac
[f3bfbec0] [c0047b8c] ___do_page_fault+0x93c/0xc14
[f3bfbf10] [c0048278] do_page_fault+0x28/0x60
[f3bfbf30] [c000433c] DataAccess_virt+0x124/0x17c
--- interrupt: 300 at 0xbe30d8
NIP:  00be30d8 LR: 00be30b4 CTR: 00000000
REGS: f3bfbf40 TRAP: 0300   Tainted: G    B               (6.8.0-rc6-PMacG4)
MSR:  0000d032 <EE,PR,ME,IR,DR,RI>  CR: 20882464  XER: 00000000
DAR: 88c7a010 DSISR: 42000000 
GPR00: 00be30b4 af8397d0 a78436c0 6b2ee010 3c500000 20224462 fe77f7e1 00b00264 
GPR08: 1d98d000 1d98c000 00000000 40ae256a 20882262 00bffff4 00000000 00000000 
GPR16: 00000000 00000002 00000000 0000005a 40802262 80002262 40002262 00c000a4 
GPR24: ffffffff ffffffff 3c500000 00000000 00000000 6b2ee010 00c07d64 00001000 
NIP [00be30d8] 0xbe30d8
LR [00be30b4] 0xbe30b4
--- interrupt: 300
page:ef4bd92c refcount:1 mapcount:0 mapping:00000000 index:0x1 pfn:0x310b3
flags: 0x80000000(zone=2)
page_type: 0xffffffff()
raw: 80000000 00000100 00000122 00000000 00000001 00000000 ffffffff 00000001
raw: 00000000
page dumped because: pagealloc: corrupted page details
page_owner info is not present (never set?)
swapper/1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G    B              6.8.0-rc6-PMacG4 #15
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[f101b9d0] [c162a8e8] dump_stack_lvl+0x60/0x94 (unreliable)
[f101b9f0] [c04ae948] warn_alloc+0x154/0x2e0
[f101bab0] [c04af030] __alloc_pages+0x55c/0xe2c
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
  cache: skbuff_head_cache, object size: 176, buffer size: 288, default order: 0, min order: 0
  node 0: slabs: 509, objs: 7126, free: 0
[...] 

New findings:
 (x) The page corruption only shows up the 1st time I run "stress -m 2 --vm-bytes 965M". When I quit and restart stress no additional page corruption shows up.
 (x) The page corruption shows up shortly after I run "stress -m 2 --vm-bytes 965M" but no additional page corruption shows up afterwards, even if left running for 30min.

For additional testing I thought it would be a good idea to try "modprobe test_vmalloc" but this remained inconclusive. Sometimes a 'BUG: Unable to handle kernel data access on read at 0xe0000000' like this shows up but not always:

[...]
vmap allocation for size 8192 failed: use vmalloc=<size> to increase size
vmalloc_test/0: vmalloc error: size 4096, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 1 PID: 1792 Comm: vmalloc_test/0 Tainted: G    B              6.8.0-rc6-PMacG4 #12
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[f3bdbcc0] [c162a8c8] dump_stack_lvl+0x60/0x94 (unreliable)
[f3bdbce0] [c04ae928] warn_alloc+0x154/0x2e0
[f3bdbda0] [c04a02d0] __vmalloc_node_range+0x168/0x96c
[f3bdbe60] [c04a0bc0] __vmalloc_node+0xec/0x114
[f3bdbea0] [bfa3d308] kvfree_rcu_2_arg_vmalloc_test+0x50/0xb4 [test_vmalloc]
[f3bdbec0] [bfa3e254] test_func+0x490/0x568 [test_vmalloc]
[f3bdbfc0] [c00ff1fc] kthread+0x384/0x394
[f3bdbff0] [c002a304] start_kernel_thread+0x10/0x14
Mem-Info:
active_anon:13098 inactive_anon:0 isolated_anon:0
 active_file:31626 inactive_file:9834 isolated_file:0
 unevictable:0 dirty:1 writeback:0
 slab_reclaimable:1935 slab_unreclaimable:23175
 mapped:15912 shmem:148 pagetables:260
 sec_pagetables:0 bounce:0
 kernel_misc_reclaimable:0
 free:379850 free_pcp:188 free_cma:0
Node 0 active_anon:52392kB inactive_anon:0kB active_file:126504kB inactive_file:39336kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:63648kB dirty:4kB writeback:0kB shmem:592kB writeback_tmp:0kB kernel_stack:1984kB pagetables:1040kB sec_pagetables:0kB all_unreclaimable? no
DMA free:545332kB boost:0kB min:3312kB low:4140kB high:4968kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:20864kB unevictable:0kB writepending:0kB present:786432kB managed:687812kB mlocked:0kB bounce:0kB free_pcp:752kB local_pcp:192kB free_cma:0kB
lowmem_reserve[]: 0 0 1184 1184
DMA: 3*4kB (ME) 15*8kB (UME) 5*16kB (ME) 7*32kB (UME) 2*64kB (UE) 2*128kB (ME) 1*256kB (M) 1*512kB (U) 5*1024kB (UM) 5*2048kB (UM) 129*4096kB (M) = 545332kB
41608 total pagecache pages
0 pages in swap cache
Free swap  = 8388604kB
Total swap = 8388604kB
524288 pages RAM
327680 pages HighMem/MovableOnly
49231 pages reserved
vmap allocation for size 8192 failed: use vmalloc=<size> to increase size
vmap allocation for size 8192 failed: use vmalloc=<size> to increase size
vmap allocation for size 147456 failed: use vmalloc=<size> to increase size
vmap allocation for size 8192 failed: use vmalloc=<size> to increase size
vmap allocation for size 4194304 failed: use vmalloc=<size> to increase size
BUG: Unable to handle kernel data access on read at 0xe0000000
Faulting instruction address: 0xbfa3dc58
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 DEBUG_PAGEALLOC PowerMac
Modules linked in: test_vmalloc(+) af_packet ipv6 b43legacy mac80211 libarc4 cfg80211 snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa rfkill therm_windtunnel firewire_ohci firewire_core crc_itu_t snd_aoa_i2sbus snd_aoa_soundbus snd_pcm snd_timer snd soundcore ssb mac_hid pcmcia pcmcia_core radeon hwmon drm_suballoc_helper i2c_algo_bit drm_ttm_helper ttm drm_display_helper backlight ohci_pci
CPU: 1 PID: 1792 Comm: vmalloc_test/0 Tainted: G    B              6.8.0-rc6-PMacG4 #12
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
NIP:  bfa3dc58 LR: bfa3dc48 CTR: 00000000
REGS: f3bdbdd0 TRAP: 0300   Tainted: G    B               (6.8.0-rc6-PMacG4)
MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 44000004  XER: 00000000
DAR: e0000000 DSISR: 40000000 
GPR00: 00000000 f3bdbe90 c3fe1820 00000000 00000000 00000000 00000000 00000000 
GPR08: 00000000 e0000000 00000000 f3bdbdf0 00000000 00000000 18baab26 00000000 
GPR16: c5d55930 f227d040 00000000 0000001e 00000007 f2276260 f22762b4 000000a0 
GPR24: c5d55928 00000061 f2280000 00000000 00000000 00000001 c4832ea8 00000001 
NIP [bfa3dc58] vm_map_ram_test+0xdc/0x194 [test_vmalloc]
LR [bfa3dc48] vm_map_ram_test+0xcc/0x194 [test_vmalloc]
Call Trace:
[f3bdbe90] [bfa3dc48] vm_map_ram_test+0xcc/0x194 [test_vmalloc] (unreliable)
[f3bdbec0] [bfa3e254] test_func+0x490/0x568 [test_vmalloc]
[f3bdbfc0] [c00ff1fc] kthread+0x384/0x394
[f3bdbff0] [c002a304] start_kernel_thread+0x10/0x14
Code: 48a7bc49 307fffff 7c63f910 4bffff7c 38a0ffff 7fa4eb78 7fc3f378 48a608c9 5469e8fe 5468077e 7c7c1b78 3d29e000 <89290000> 7d2a0774 5529cffe 7d0a4010 
---[ end trace 0000000000000000 ]---

note: vmalloc_test/0[1792] exited with irqs disabled
[...]

So this may be a different issue as the memory corruption or just the test gone wrong?

Full dmesg + kernel .config attached. The KASAN hit in the dmesg earlier 'BUG: KASAN: vmalloc-out-of-bounds in memset32' Christophe already reported as https://lore.kernel.org/all/2000a30f-214a-4b20-b0b5-348e987d6a0e@xxxxxxxxxx/

Regards,
Erhard
Attachment:
config_68-rc6_g4--

Description: Binary data
Attachment:
dmesg_68-rc6_g4--

Description: Binary data