On 04.07.23 09:34, Suren Baghdasaryan wrote:
On Tue, Jul 4, 2023 at 12:18 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
On 04.07.23 08:50, Suren Baghdasaryan wrote:
On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
On 03.07.23 20:21, Suren Baghdasaryan wrote:
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86.
Disable per-VMA locks config to prevent this issue while the problem is
being investigated. This is expected to be a temporary measure.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@xxxxxxxxxx
Reported-by: Jiri Slaby <jirislaby@xxxxxxxxxx>
Reported-by: Jacob Young <jacobly.alt@xxxxxxxxx>
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
---
mm/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..de94b2497600 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
def_bool n
config PER_VMA_LOCK
- def_bool y
+ bool "Enable per-vma locking during page fault handling."
depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
help
Allow per-vma locking during page fault handling.
As raised at LSF/MM, I was "surprised" that we can now handle page faults
concurrent to fork() and was expecting something to be broken already.
What probably happens is that we wr-protected the page in the parent process and
COW-shared an anon page with the child using copy_present_pte().
But we only flush the parent MM tlb before we drop the parent MM lock in
dup_mmap().
If we get a write-fault before that TLB flush in the parent, and we end up
replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
this might be problematic: some stale writable TLB entries can target the wrong (old) page.
Hi David,
Thanks for the detailed explanation. Let me check if this is indeed
what's happening here. If that's indeed the cause, I think we can
write-lock the VMAs being dup'ed until the TLB is flushed and
mmap_write_unlock(oldmm) unlocks them all and lets page faults to
proceed. If that works we at least will know the reason for the memory
corruption.
Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:
for_each_vma(old_vmi, mpnt) {
struct file *file;
+ vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;
}
At least the reproducer at
https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
I wonder if that's the best way to fix this. It's surely simple but
locking every VMA is not free and doing that on every fork might
regress performance.
That would mean that we can possibly still get page faults concurrent to
fork(), on the yet unprocessed part. While that fixes the issue at hand,
I cannot reliably tell if this doesn't mess with some other fork()
corner case.
I'd suggest write-locking all VMAs upfront, before doing any kind of
fork-mm operation. Just like the old code did. See below.
Maybe we could get away by not locking VM_MAYSHARE or VM_DONTCOPY.
Possibly also when there are no other threads.
But at least to me it feels safer to defer any such optimizations, and
to see if it's really required.
If there are no other threads, at least there will not be contention on
the VMA locks. And if there are others threads, we used to have
contention on the mmap lock already.
--
Cheers,
David / dhildenb