On 31.08.24 11:55, Barry Song wrote:
On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
On 31.08.24 11:23, Barry Song wrote:
From: Barry Song <v-songbaohua@xxxxxxxx>
On a physical phone, it's sometimes observed that deferred_split
mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
indicates that the majority of these originate from the typical fork
scenario.
When the child process either execs or exits, the parent process should
ideally be able to reuse the entire mTHP. However, the current kernel
lacks this capability and instead places the mTHP into split_deferred,
performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
main()
{
#define SIZE 1024 * 1024UL
void *p = malloc(SIZE);
memset(p, 0x11, SIZE);
if (fork() == 0)
exec(....);
/*
* this will trigger cow one subpage from
* mTHP and put mTHP into split_deferred
* list
*/
*(int *)(p + 10) = 10;
printf("done\n");
while(1);
}
This leads to two significant issues:
* Memory Waste: Before the mTHP is fully split by the shrinker,
it wastes memory. In extreme cases, such as with a 64KB mTHP,
the memory usage could be 64KB + 60KB until the last subpage
is written, at which point the mTHP is freed.
* Fragmentation and Performance Loss: It destroys large folios
(negating the performance benefits of CONT-PTE) and fragments memory.
To address this, we should aim to reuse the entire mTHP in such cases.
Hi David,
I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
entirely_reuse argument because I’m not sure if there are still cases
where we reuse a subpage within an mTHP. For now, I’m setting
entirely_reuse to true only for the newly supported case, while all
other cases still get false. Please let me know if this is incorrect—if
we don’t reuse subpages at all, we could remove the argument.
See [1] I sent out this week, that is able to reuse even without
scanning page tables. If we find the the folio is exclusive we could try
processing surrounding PTEs that map the same folio.
[1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@xxxxxxxxxx
Great! It looks like I missed your patch again. Since you've implemented this
in a better way, I’d prefer to use your patchset.
I wouldn't say better, just more universally. And while taking care of
properly sync'ing the mapcount vs. refcount :P
I’m curious about how you're handling ptep_set_access_flags_nr() or similar
things because I couldn’t find the related code in your patch 10/17:
[PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
Am I missing something?
The idea is to keep individual write faults as fast as possible. So the
patch set keeps it simple and only reuses a single PTE at a time,
setting that one PAE and mapping it writable.
As the patch states, it might be reasonable to optimize some cases,
maybe also only on some architectures. For example to fault-around and
map the other ones writable as well. It might not always be desirable
though, especially not for larger folios.
--
Cheers,
David / dhildenb