Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 17 Dec 2021 20:02:54 -0800

On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <namit@xxxxxxxxxx> wrote:
>
> I understand the discussion mainly revolves correctness, which is
> obviously the most important property, but I would like to mention
> that having transient get_page() calls causing unnecessary COWs can
> cause hard-to-analyze and hard-to-avoid performance degradation.

Note that the COW itself is pretty cheap. Yes, there's the page
allocation and copy, but it's mostly a local thing.

So that falls under the "good to avoid" heading, but in the end it's
not an immense deal.

In contrast, the page lock has been an actual big user-visible latency
issue, to the point of correctness.

A couple of years ago, we literally had NMI watchdog timeouts due to
the page wait-queues growing basically boundlessly. This was some
customer internal benchmark code that I never saw, so it wasn't
*quite* clear exactly what was going on, but we ended up having to
split up the page wait list traversal using bookmark entries, because
it was such a huge latency issue.

That was mostly NUMA balancing faults, I think, but the point I'm
making is that avoiding the page lock can be a *much* bigger deal than
avoiding some local allocation and copying of a page of data. There
are real loads where the page-lock gets insanely bad, and I think it's
because we use it much too much.

See commit 2554db916586 ("sched/wait: Break up long wake list walk")
for some of that saga.

So I really think that having to serialize with the page lock in order
to do some "exact page use counting" is a false economy. Yes, maybe
you'd be able to avoid a COW or two, but at what locking cost?

                Linus