On Sat, Jul 11, 2020 at 10:27 AM Naresh Kamboju <naresh.kamboju@xxxxxxxxxx> wrote: > > I have started bisecting this problem and found the first bad commit Thanks for the effort. Bisection is often a great tool to figure out what's wrong. Sadly, in this case: > commit 9f132f7e145506efc0744426cb338b18a54afc3b > Author: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx> > Date: Thu Jan 3 15:28:41 2019 -0800 > > mm: select HAVE_MOVE_PMD on x86 for faster mremap Yeah, that's just the commit that enables the code, not the commit that introduces the fundamental problem. That said, this is a prime example of why I absolutely detest patch series that do this kind of thing, and are several patches that create new functionality, followed by one patch to enable it. If you can't get things working incrementally, maybe you shouldn't do them at all. Doing a big series of "hidden work" and then enabling it later is wrong. In this case, though, the real patch that did the code isn't that kind of "big series of hidden work" patch series, it's just the (single) previous commit 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions"). So your bisection is useful, it's just that it really points to that previous commit, and it's where this code was introduced. It's also worth noting that that commit doesn't really *break* anything, since it just falls back to the old behavior when it warns. So to "fix" your test-case, we could just remove the WARN_ON. But the WARN_ON() is still worrisome, because the thing it tests for really _should_ be true. Now, we actually have a known bug in this area that is talked about elsewhere: the way unmap does the pgtable_free() is /* Detach vmas from rbtree */ detach_vmas_to_be_unmapped(mm, vma, prev, end); if (downgrade) mmap_write_downgrade(mm); unmap_region(mm, vma, prev, start, end); (and unmap_region() is what does the pgtable_free() that should have cleared the PMD). And the problem with the "downgrade" is that another thread can change the beginning of the next vma when it's a grow-down region (or the end of the prev one if it's a grow-up). See commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") for the source of that But that requires an _actual_ "unmap()" system call (the others set "downgrade" to false - I'm not entirely sure why), and it requires another thread to be attached to that VM in order to actually do that racy concurrent stack size change. And neither of those two cases will be true for the execve() path. It's a new VM, with just one thread attached, so no threaded munmap() going on there. The fact that it seems to happen with https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/thp/thp01.c makes me think it's somehow related to THP mappings, but I don't see why those would matter. All the same pmd freeing should still have happened, afaik. And the printout I asked for a few days back for when it triggered clearly showed a normal non-huge pmd ("val: 7d530067" is just "accessed, dirty, rw, user and present", which is a perfectly normal page directory entry for 4kB pages, and we could move the whole thing and move 2MB (or 4MB) of aligned virtual memory in one go). Some race with THP splitting and pgtable_free()? I can't see how anything would race in execve(), or how anything would have touched that address below the stack in the first place.. Kirill, Oleg, and reaction from this? Neither of you were on the original email, I think, it's this one: https://lore.kernel.org/lkml/CA+G9fYt+6OeibZMD0fP=O3nqFbcN3O4xcLkjq0mpQbZJ2uxB9w@xxxxxxxxxxxxxx/ and I really think it is harmless in that when the warning triggers, we just go back to the page-by-page code, but while I think the WARN_ON() should probably be downgraded to a WARN_ON_ONCE(), I do think it's showing _something_. I just can't see how this would trigger for execve(). That's just about the _simplest_ case for us: single-threaded, mappings set up purely by load_elf_binary() etc. I'm clearly missing something. Linus