* Sven Schnelle <svens@xxxxxxxxxxxxx> [220515 16:02]: > Liam Howlett <liam.howlett@xxxxxxxxxx> writes: > > > * Sven Schnelle <svens@xxxxxxxxxxxxx> [220513 10:46]: > >> Starting today we're still seeing the same crash with linux-next from > >> (next-20220513): > >> > >> [ 211.937897] CPU: 7 PID: 535 Comm: pt_upgrade Not tainted 5.18.0-rc6-11648-g76535d42eb53-dirty #732 > >> [ 211.937902] Unable to handle kernel pointer dereference in virtual kernel address space > >> [ 211.937903] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0) > >> [ 211.937906] Failing address: 0e00000000000000 TEID: 0e00000000000803 > >> [ 211.937909] Krnl PSW : 0704c00180000000 0000001ca52f06d6 > >> [ 211.937910] Fault in home space mode while using kernel ASCE. > >> [ 211.937917] AS:0000001ca6e24007 R3:0000001fffff0007 S:0000001ffffef800 P:000000000000003d > >> [ 211.937914] (mmap_region+0x19e/0x848) > >> [ 211.937929] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 > >> [ 211.937939] Krnl GPRS: 0000000000000000 0e00000000000000 0000000000000000 0000000000000000 > >> [ 211.937942] ffffffff00000f0f ffffffffffffffff 0e00000000000000 0000040000001000 > >> [ 211.937945] 0000000083551900 0000040000000000 00000000000000fb 000003800070fc58 > >> [ 211.937947] 000000008f490000 0000000000000000 0000001ca52f06b6 000003800070fb48 > >> [ 211.937959] Krnl Code: 0000001ca52f06c6: a7740021 brc 7,0000001ca52f0708 > >> [ 211.937959] 0000001ca52f06ca: ec6801b3007c cgij %r6,0,8,0000001ca52f0a30 > >> [ 211.937959] #0000001ca52f06d0: e310f0f80004 lg %r1,248(%r15) > >> [ 211.937959] >0000001ca52f06d6: e37010000020 cg %r7,0(%r1) > >> [ 211.937959] 0000001ca52f06dc: a78400ea brc 8,0000001ca52f08b0 > >> [ 211.937959] 0000001ca52f06e0: e310f0f00004 lg %r1,240(%r15) > >> [ 211.937959] 0000001ca52f06e6: ec180008007c cgij %r1,0,8,0000001ca52f06f6 > >> [ 211.937959] 0000001ca52f06ec: e39010080020 cg %r9,8(%r1) > >> [ 211.937973] Call Trace: > >> [ 211.937975] [<0000001ca52f06d6>] mmap_region+0x19e/0x848 > >> [ 211.937978] ([<0000001ca52f06b6>] mmap_region+0x17e/0x848) > >> [ 211.937981] [<0000001ca52f116a>] do_mmap+0x3ea/0x4c8 > >> [ 211.937983] [<0000001ca52bed12>] vm_mmap_pgoff+0xda/0x178 > >> [ 211.937987] [<0000001ca52ed5ea>] ksys_mmap_pgoff+0x62/0x238 > >> [ 211.937989] [<0000001ca52ed992>] __s390x_sys_old_mmap+0x7a/0xa0 > >> [ 211.937993] [<0000001ca5c4ef5c>] __do_syscall+0x1d4/0x200 > >> [ 211.937999] [<0000001ca5c5d572>] system_call+0x82/0xb0 > >> [ 211.938002] Last Breaking-Event-Address: > >> [ 211.938003] [<0000001ca5888616>] mas_prev+0xb6/0xc0 > >> [ 211.938010] Oops: 0038 ilc:3 [#2] > >> [ 211.938011] Kernel panic - not syncing: Fatal exception: panic_on_oops > >> [ 211.938012] SMP > >> [ 211.938014] Modules linked in: > >> 07: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 0000001C > >> A50679A6 > >> > >> IS that issue supposed to be fixed? git bisect pointed me to > >> > >> # bad: [76535d42eb53485775a8c54ea85725812b75543f] Merge branch > >> 'mm-everything' of > >> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > >> > >> which isn't really helpful. > >> > >> Anything we could help with debugging this? > > > > I tested the maple tree on top of the s390 as it was the same crash and > > it was okay. I haven't tested the mm-everything branch though. Can you > > test mm-unstable? > > Yes, i tested mm-unstable but wasn't able to reproduce the issue. > > > I'll continue setting up a sparc VM for testing here and test > > mm-everything on that and the s390 > > One thing that is different compared to x86 is that both sparc and s390 > are big endian. Not sure whether and where that would make a difference. > > The code to trigger the crash on s390 is rather simple: Just force a > paging level upgrade to 5 levels by calling mmap() with an address that > doesn't fit in 3 levels. Haven't tested whether an upgrade to 4 levels > would be sufficent. I've condensed our test case that triggers this, and > basically all that is required is: > > --------------------------------8<--------------------------------------- > #include <stdlib.h> > #include <unistd.h> > #include <sys/mman.h> > #include <sys/wait.h> > #include <stdio.h> > > #define PAGE_SIZE 0x1000 > #define _REGION1_SIZE (1UL << 54) > > int main(int argc, char *argv[]) > { > int pid, status; > void *addr; > > pid = fork(); > if (pid == 0) { > /* > * Trigger page table level upgrade > */ > addr = mmap((void *)_REGION1_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > if (addr == MAP_FAILED) > return 1; > *(int *)addr = 1; > return 0; > } > wait(&status); > return 0; > } > --------------------------------8<--------------------------------------- > I tried the above on my qemu s390 with kernel 5.18.0-rc6-next-20220513, but it runs without issue, return code is 0. Is there something the VM needs to have for this to trigger? > I've added a few debug statements to the maple tree code: > > [ 27.769641] mas_next_entry: offset=14 > [ 27.769642] mas_next_nentry: entry = 0e00000000000000, slots=0000000090249f80, mas->offset=15 count=14 Where exactly are you printing this? > > I see in mas_next_nentry() that there's a while that iterates over the > (used?) slots until count is reached.` Yes, mas_next_nentry() looks for the next non-null entry in the current node. >After that loop mas_next_entry() > just picks the next (unused?) entry, which is slot 15 in that case. mas_next_entry() returns the next non-null entry. If there isn't one returned by mas_next_nentry(), then it will advance to the next node by calling mas_next_node(). There are checks in there for detecting dead nodes for RCU use and limit checking as well. > > What i noticed while scanning over include/linux/maple_tree.h is: > > struct maple_range_64 { > struct maple_pnode *parent; > unsigned long pivot[MAPLE_RANGE64_SLOTS - 1]; > union { > void __rcu *slot[MAPLE_RANGE64_SLOTS]; > struct { > void __rcu *pad[MAPLE_RANGE64_SLOTS - 1]; > struct maple_metadata meta; > }; > }; > }; > > and struct maple_metadata is: > > struct maple_metadata { > unsigned char end; > unsigned char gap; > }; > > If i swap the gap and end members 0x0e00000000000000 becomes > 0x000e000000000000. And 0xe matches our msa->offset 14 above. > So it looks like mas_next() in mmap_region returns the meta > data for the node. If this is the case, then I think any task that has more than 14 VMAs would have issues. I also use mas_next_entry() in mas_find() which is used for the mas_for_each() macro/iterator. Can you please enable CONFIG_DEBUG_VM_MAPLE_TREE ? mmap.c tests the tree after pretty much any change and will dump useful information if there is an issue - including the entire tree. See validate_mm_mt() for details. You can find CONFIG_DEBUG_VM_MAPLE_TREE in the config: kernel hacking -> Memory debugging -> Debug VM -> Debug VM maple trees > > So from the lines above you likely already guessed that i have no clue > how mapple tree works, and i didn't had enough time today to read all > the magic and understand it. But i thought i just drop my observation > here in case someone has an idea. Thanks for sharing. I'm having a hard time recreating the issue so I cannot fully dig in myself. I was able to boot spar64 with mm-unstable. I did get an error: [ 5.002625] Kernel unaligned access at TPC[59bae8] mmap_region+0x168/0xb00 faddr2line is less than useful though with reported line "at ??:?" I'll keep digging into that. Thanks, Liam