Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 8 Jul 2023 15:53:40 -0700

On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds
> >
> > Again - maybe I messed up, but it really feels like the missing
> > vma_start_write() was more fundamental, and not some "TLB coherency"
> > issue.
>
> Sounds plausible. I'll try to use the reproducer to verify if that's
> indeed happening here.

I really don't think that's what people are reporting, I was just
trying to make up a completely different case that has nothing to do
with any TLB issues.

My real point was simply this one:

> It's likely there are multiple problematic
> scenarios due to this missing lock though.

Right. That's my issue. I felt your explanation was *too* targeted at
some TLB non-coherency thing, when I think the problem was actually a
much larger "page faults simply must not happen while we're copying
the page tables because data isn't coherent".

The anon_vma case was just meant as another random example of the
other kinds of things I suspect can go wrong, because we're simply not
able to do this whole "copy vma while it's being modified by page
faults".

Now, I agree that the PTE problem is real, and probable the main
thing, ie when we as part of fork() do this:

        /*
         * If it's a COW mapping, write protect it both
         * in the parent and the child
         */
        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
                ptep_set_wrprotect(src_mm, addr, src_pte);
                pte = pte_wrprotect(pte);
        }

and the thing that can go wrong before the TLB flush happens is that -
because the TLB's haven't been flushed yet - some threads in the
parent happily continue to write to the page and didn't see the
wrprotect happening.

And then you get into the situation where *some* thread see the page
protections change (maybe they had a TLB flush event on that CPU for
random reasons), and they will take a page fault and do the COW thing
and create a new page.

And all the while *other* threads still see the old writeable TLB
state, and continue to write to the old page.

So now you have a page that gets its data copied *while* somebody is
still writing to it, and the end result is that some write easily gets
lost, and so when that new copy is installed, you see it as data
corruption.

And I agree completely that that is probably the thing that most
people actually saw and reacted to as corruption.

But the reason I didn't like the explanation was that I think this is
just one random example of the more fundamental issue of "we simply
must not take page faults while copying".

Your explanation made me think "stale TLB is the problem", and *that*
was what I objected to. The stale TLB was just one random sign of the
much larger problem.

It might even have been the most common symptom, but I think it was
just a *symptom*, not the *cause* of the problem.

And I must have been bad at explaining that, because David Hildenbrand
also reacted negatively to my change.

So I'll happily take a patch that adds more commentary about this, and
gives several examples of the things that go wrong.

                Linus