Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

Nadav Amit <namit@xxxxxxxxxx> · Sat, 18 Dec 2021 04:52:13 +0000

> On Dec 17, 2021, at 8:02 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> 
> On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <namit@xxxxxxxxxx> wrote:
>> 
>> I understand the discussion mainly revolves correctness, which is
>> obviously the most important property, but I would like to mention
>> that having transient get_page() calls causing unnecessary COWs can
>> cause hard-to-analyze and hard-to-avoid performance degradation.
> 
> Note that the COW itself is pretty cheap. Yes, there's the page
> allocation and copy, but it's mostly a local thing.

I don’t know about the page-lock overhead, but I understand your argument.

Having said that, I do know a bit about TLB flushes, which you did not
mention as overheads of COW. Such flushes can be quite expensive on
multithreaded workloads (specifically on VMs, but lets put those aside).

Take for instance memcached and assume you overcommit memory with a very fast
swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached
often accesses a page first for read and shortly after for write. I
encountered, in a similar scenario, that the page reference that
lru_cache_add() takes during the first faultin event (for read), causes a COW
on a write page-fault that happens shortly after [1]. So on memcached I
assume this would also trigger frequent unnecessary COWs.

Besides page allocation and copy, COW would then require a TLB flush, which,
when performed locally, might not be too bad (~200 cycles). But if memcached
has many threads, as it usually does, then you need a TLB shootdown and this
one can be expensive (microseconds). If you start getting a TLB shootdown
storm, you may avoid some IPIs since you see that other CPUs already queued
IPIs for the target CPU. But then the kernel would flush the entire TLB on
the the target CPU, as it realizes that multiple TLB flushes were queued,
and as it assumes that a full TLB flush would be cheaper.

[ I can try to run a benchmark during the weekend to measure the impact, as I
  did not really measure the impact on memcached before/after 5.8. ]

So I am in no position to prioritize one overhead over the other, but I do
not think that COW can be characterized as mostly-local and cheap in the
case of multithreaded workloads.

[1] https://lore.kernel.org/linux-mm/0480D692-D9B2-429A-9A88-9BBA1331AC3A@xxxxxxxxx/