Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table

David Hildenbrand <david@xxxxxxxxxx> · Tue, 14 Feb 2023 17:58:45 +0100

Honestly, for improving the fork(), I have an idea to skip the per-page
operation without breaking the logic. However, this will introduce the
complicated mechanism and may has the overhead for other features. It
might not be worth it. It's hard to strike a balance between the
over-complicated mechanism with (probably) better performance and data
consistency with the page status. So, I would focus on the safety and
stable approach at first.
Yes, it is most probably possible, but complexity, robustness and
maintainability have to be considered as well.

Thanks for implementing this approach (only deduplication without other
optimizations) and evaluating it accordingly. It's certainly "cleaner", such
that we only have to mess with unsharing and not with other
accounting/pinning/mapcount thingies. But it also highlights how intrusive
even this basic deduplication approach already is -- and that most benefits
of the original approach requires even more complexity on top.

I am not quite sure if the benefit is worth the price (I am not to decide
and I would like to hear other options).
I'm looking at the discussion of page table sharing in 2002 [1].
It looks like in 2002 ~ 2006, there also have some patches try to
improve fork().

After that, I also saw one thread which is about another shared page
table patch's benchmark. I can't find the original patch though [2].
But, I found the probably same patch in 2005 [3], it also mentioned
the previous benchmark discussion:

"
For those familiar with the shared page table patch I did a couple of years
ago, this patch does not implement copy-on-write page tables for private
mappings.  Analysis showed the cost and complexity far outweighed any
potential benefit.
"
Thanks for the pointer, interesting read. And my personal opinion is 
that part of that statement still hold true :)
However, it might be different right now. For example, the implemetation
. We have split page table lock now, so we don't have to consider the
page_table_share_lock thing. Also, presently, we have different use
cases (shells [2] v.s. VM cloning and fuzzing) to consider.

Nonetheless, I still think the discussion can provide some of the mind
to us.

BTW, It seems like the 2002 patch [1] is different from the 2002 [2]
and 2005 [3].

[1] https://lkml.iu.edu/hypermail/linux/kernel/0202.2/0102.html
[2] https://lore.kernel.org/linux-mm/3E02FACD.5B300794@xxxxxxxxx/
[3] https://lore.kernel.org/linux-mm/7C49DFF721CB4E671DB260F9@%5B10.1.1.4%5D/T/#u

My quick thoughts after skimming over the core parts of this series

(1) forgetting to break COW on a PTE in some pgtable walker feels quite
     likely (meaning that it might be fairly error-prone) and forgetting
     to break COW on a PTE table, accidentally modifying the shared
     table.
Maybe I should also handle arch/ and others parts.
I will keep looking at where I missed.
One could add sanity checks when modifying a PTE while the PTE table is 
still marked shared ... but I guess there are some valid reasons where 
we might want to modify shared PTE tables (rmap).
(2) break_cow_pte() can fail, which means that we can fail some
     operations (possibly silently halfway through) now. For example,
     looking at your change_pte_range() change, I suspect it's wrong.
Maybe I should add WARN_ON() and skip the failed COW PTE.
One way or the other we'll have to handle it. WARN_ON() sounds wrong for 
handling OOM situations (e.g., if only that cgroup is OOM).
(3) handle_cow_pte_fault() looks quite complicated and needs quite some
     double-checking: we temporarily clear the PMD, to reset it
     afterwards. I am not sure if that is correct. For example, what
     stops another page fault stumbling over that pmd_none() and
     allocating an empty page table? Maybe there are some locking details
     missing or they are very subtle such that we better document them. I
    recall that THP played quite some tricks to make such cases work ...
I think that holding mmap_write_lock may be enough (I added
mmap_assert_write_locked() in the fault function btw). But, I might
be wrong. I will look at the THP stuff to see how they work. Thanks.

Ehm, but page faults don't hold the mmap lock writable? And so are other 
callers, like MADV_DONTNEED or MADV_FREE.
handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() 
should bail out.
Either I am missing something or you didn't test with lockdep enabled :)

Note that there are upstream efforts to use only a VMA lock (and some 
people even want to perform some page faults only protected by RCU).
--
Thanks,

David / dhildenb