On Fri, 10 Jul 2020 at 10:55, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Thu, Jul 9, 2020 at 9:29 PM Naresh Kamboju <naresh.kamboju@xxxxxxxxxx> wrote: > > > > Your patch applied and re-tested. > > warning triggered 10 times. > > > > old: bfe00000-c0000000 new: bfa00000 (val: 7d530067) > > Hmm.. It's not even the overlapping case, it's literally just "move > exactly 2MB of page tables exactly one pmd down". Which should be the > nice efficient case where we can do it without modifying the lower > page tables at all, we just move the PMD entry. > > There shouldn't be anything in the new address space from bfa00000-bfdfffff. > > That PMD value obviously says differently, but it looks like a nice > normal PMD value, nothing bad there. > > I'm starting to think that the issue might be that this is because the > stack segment is special. Not only does it have the growsdown flag, > but that whole thing has the magic guard page logic. > > So I wonder if we have installed a guard page _just_ below the old > stack, so that we have populated that pmd because of that. > > We used to have an _actual_ guard page and then play nasty games with > vm_start logic. We've gotten rid of that, though, and now we have that > "stack_guard_gap" logic that _should_ mean that vm_start is always > exact and proper (and that pgtbales_free() should have emptied it, but > maybe we have some case we forgot about. > > > [ 741.511684] WARNING: CPU: 1 PID: 15173 at mm/mremap.c:211 move_page_tables.cold+0x0/0x2b > > [ 741.598159] Call Trace: > > [ 741.600694] setup_arg_pages+0x22b/0x310 > > [ 741.621687] load_elf_binary+0x31e/0x10f0 > > [ 741.633839] __do_execve_file+0x5a8/0xbf0 > > [ 741.637893] __ia32_sys_execve+0x2a/0x40 > > [ 741.641875] do_syscall_32_irqs_on+0x3d/0x2c0 > > [ 741.657660] do_fast_syscall_32+0x60/0xf0 > > [ 741.661691] do_SYSENTER_32+0x15/0x20 > > [ 741.665373] entry_SYSENTER_32+0x9f/0xf2 > > [ 741.734151] old: bfe00000-c0000000 new: bfa00000 (val: 7d530067) > > Nothing looks bad, and the ELF loading phase memory map should be > really quite simple. > > The only half-way unusual thing is that you have basically exactly 2MB > of stack at execve time (easy enough to tune by just setting argv/env > right), and it's moved down by exactly 2MB. > > And that latter thing is just due to randomization, see > arch_align_stack() in arch/x86/kernel/process.c. > > So that would explain why it doesn't happen every time. > > What happens if you apply the attached patch to *always* force the 2MB > shift (rather than moving the stack by a random amount), and then run > the other program (t.c -> compiled to "a.out"). I have applied your patch and test started in a loop for a million times but the test ran for 35 times. Seems like the test got a timeout after 1 hour. kernel messages printed while testing a.out a.out (480) used greatest stack depth: 4872 bytes left On other device kworker/dying (172) used greatest stack depth: 5044 bytes left Re-running test with long timeouts 4 hours and will share findings. ref: https://lkft.validation.linaro.org/scheduler/job/1555132#L1515 - Naresh