Re: [PATCH] mapletree-vs-khugepaged

Liam Howlett <liam.howlett@xxxxxxxxxx> · Mon, 16 May 2022 14:02:09 +0000

* Sven Schnelle <svens@xxxxxxxxxxxxx> [220515 16:02]:
> Liam Howlett <liam.howlett@xxxxxxxxxx> writes:
> 
> > * Sven Schnelle <svens@xxxxxxxxxxxxx> [220513 10:46]:
> >> Starting today we're still seeing the same crash with linux-next from
> >> (next-20220513):
> >>
> >> [  211.937897] CPU: 7 PID: 535 Comm: pt_upgrade Not tainted 5.18.0-rc6-11648-g76535d42eb53-dirty #732
> >> [  211.937902] Unable to handle kernel pointer dereference in virtual kernel address space
> >> [  211.937903] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> >> [  211.937906] Failing address: 0e00000000000000 TEID: 0e00000000000803
> >> [  211.937909] Krnl PSW : 0704c00180000000 0000001ca52f06d6
> >> [  211.937910] Fault in home space mode while using kernel ASCE.
> >> [  211.937917] AS:0000001ca6e24007 R3:0000001fffff0007 S:0000001ffffef800 P:000000000000003d
> >> [  211.937914]  (mmap_region+0x19e/0x848)
> >> [  211.937929]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> >> [  211.937939] Krnl GPRS: 0000000000000000 0e00000000000000 0000000000000000 0000000000000000
> >> [  211.937942]            ffffffff00000f0f ffffffffffffffff 0e00000000000000 0000040000001000
> >> [  211.937945]            0000000083551900 0000040000000000 00000000000000fb 000003800070fc58
> >> [  211.937947]            000000008f490000 0000000000000000 0000001ca52f06b6 000003800070fb48
> >> [  211.937959] Krnl Code: 0000001ca52f06c6: a7740021            brc     7,0000001ca52f0708
> >> [  211.937959]            0000001ca52f06ca: ec6801b3007c        cgij    %r6,0,8,0000001ca52f0a30
> >> [  211.937959]           #0000001ca52f06d0: e310f0f80004        lg      %r1,248(%r15)
> >> [  211.937959]           >0000001ca52f06d6: e37010000020        cg      %r7,0(%r1)
> >> [  211.937959]            0000001ca52f06dc: a78400ea            brc     8,0000001ca52f08b0
> >> [  211.937959]            0000001ca52f06e0: e310f0f00004        lg      %r1,240(%r15)
> >> [  211.937959]            0000001ca52f06e6: ec180008007c        cgij    %r1,0,8,0000001ca52f06f6
> >> [  211.937959]            0000001ca52f06ec: e39010080020        cg      %r9,8(%r1)
> >> [  211.937973] Call Trace:
> >> [  211.937975]  [<0000001ca52f06d6>] mmap_region+0x19e/0x848
> >> [  211.937978] ([<0000001ca52f06b6>] mmap_region+0x17e/0x848)
> >> [  211.937981]  [<0000001ca52f116a>] do_mmap+0x3ea/0x4c8
> >> [  211.937983]  [<0000001ca52bed12>] vm_mmap_pgoff+0xda/0x178
> >> [  211.937987]  [<0000001ca52ed5ea>] ksys_mmap_pgoff+0x62/0x238
> >> [  211.937989]  [<0000001ca52ed992>] __s390x_sys_old_mmap+0x7a/0xa0
> >> [  211.937993]  [<0000001ca5c4ef5c>] __do_syscall+0x1d4/0x200
> >> [  211.937999]  [<0000001ca5c5d572>] system_call+0x82/0xb0
> >> [  211.938002] Last Breaking-Event-Address:
> >> [  211.938003]  [<0000001ca5888616>] mas_prev+0xb6/0xc0
> >> [  211.938010] Oops: 0038 ilc:3 [#2]
> >> [  211.938011] Kernel panic - not syncing: Fatal exception: panic_on_oops
> >> [  211.938012] SMP
> >> [  211.938014] Modules linked in:
> >> 07: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 0000001C
> >> A50679A6
> >>
> >> IS that issue supposed to be fixed? git bisect pointed me to
> >>
> >> # bad: [76535d42eb53485775a8c54ea85725812b75543f] Merge branch
> >>   'mm-everything' of
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >>
> >> which isn't really helpful.
> >>
> >> Anything we could help with debugging this?
> >
> > I tested the maple tree on top of the s390 as it was the same crash and
> > it was okay.  I haven't tested the mm-everything branch though.  Can you
> > test mm-unstable?
> 
> Yes, i tested mm-unstable but wasn't able to reproduce the issue.
> 
> > I'll continue setting up a sparc VM for testing here and test
> > mm-everything on that and the s390
> 
> One thing that is different compared to x86 is that both sparc and s390
> are big endian. Not sure whether and where that would make a difference.
> 
> The code to trigger the crash on s390 is rather simple: Just force a
> paging level upgrade to 5 levels by calling mmap() with an address that
> doesn't fit in 3 levels. Haven't tested whether an upgrade to 4 levels
> would be sufficent. I've condensed our test case that triggers this, and
> basically all that is required is:
> 
> --------------------------------8<---------------------------------------
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <sys/wait.h>
> #include <stdio.h>
> 
> #define PAGE_SIZE       0x1000
> #define _REGION1_SIZE   (1UL << 54)
> 
> int main(int argc, char *argv[])
> {
>         int pid, status;
>         void *addr;
> 
>         pid = fork();
>         if (pid == 0) {
>                 /*
>                  * Trigger page table level upgrade
>                  */
>                 addr = mmap((void *)_REGION1_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE,
>                             MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>                 if (addr == MAP_FAILED)
>                         return 1;
>                 *(int *)addr = 1;
>                 return 0;
>         }
>         wait(&status);
>         return 0;
> }
> --------------------------------8<---------------------------------------
> 

I tried the above on my qemu s390 with kernel 5.18.0-rc6-next-20220513,
but it runs without issue, return code is 0.  Is there something the VM
needs to have for this to trigger?

> I've added a few debug statements to the maple tree code:
> 
> [   27.769641] mas_next_entry: offset=14
> [   27.769642] mas_next_nentry: entry = 0e00000000000000, slots=0000000090249f80, mas->offset=15 count=14

Where exactly are you printing this?

> 
> I see in mas_next_nentry() that there's a while that iterates over the
> (used?) slots until count is reached.`

Yes, mas_next_nentry() looks for the next non-null entry in the current
node.

>After that loop mas_next_entry()
> just picks the next (unused?) entry, which is slot 15 in that case.

mas_next_entry() returns the next non-null entry.  If there isn't one
returned by mas_next_nentry(), then it will advance to the next node by
calling mas_next_node().  There are checks in there for detecting dead
nodes for RCU use and limit checking as well.

> 
> What i noticed while scanning over include/linux/maple_tree.h is:
> 
> struct maple_range_64 {
> 	struct maple_pnode *parent;
> 	unsigned long pivot[MAPLE_RANGE64_SLOTS - 1];
> 	union {
> 		void __rcu *slot[MAPLE_RANGE64_SLOTS];
> 		struct {
> 		void __rcu *pad[MAPLE_RANGE64_SLOTS - 1];
> 		struct maple_metadata meta;
>         	};
> 	};
> };
> 
> and struct maple_metadata is:
> 
> struct maple_metadata {
> 	unsigned char end;
> 	unsigned char gap;
> };
> 
> If i swap the gap and end members 0x0e00000000000000 becomes
> 0x000e000000000000. And 0xe matches our msa->offset 14 above.
> So it looks like mas_next() in mmap_region returns the meta
> data for the node.

If this is the case, then I think any task that has more than 14 VMAs
would have issues.  I also use mas_next_entry() in mas_find() which is
used for the mas_for_each() macro/iterator.  Can you please enable
CONFIG_DEBUG_VM_MAPLE_TREE ?  mmap.c tests the tree after pretty much
any change and will dump useful information if there is an issue -
including the entire tree. See validate_mm_mt() for details.

You can find CONFIG_DEBUG_VM_MAPLE_TREE in the config:
kernel hacking -> Memory debugging -> Debug VM -> Debug VM maple trees

> 
> So from the lines above you likely already guessed that i have no clue
> how mapple tree works, and i didn't had enough time today to read all
> the magic and understand it. But i thought i just drop my observation
> here in case someone has an idea.

Thanks for sharing.  I'm having a hard time recreating the issue so I
cannot fully dig in myself.

I was able to boot spar64 with mm-unstable.  I did get an error:
[    5.002625] Kernel unaligned access at TPC[59bae8]
mmap_region+0x168/0xb00

faddr2line is less than useful though with reported line "at ??:?"

I'll keep digging into that.

Thanks,
Liam