Re: [PATCH RFC] mm: entirely reuse the whole anon mTHP in do_wp_page

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 31.08.24 12:49, Barry Song wrote:
On Sat, Aug 31, 2024 at 10:29 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 31.08.24 12:21, Barry Song wrote:
On Sat, Aug 31, 2024 at 10:07 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 31.08.24 11:55, Barry Song wrote:
On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 31.08.24 11:23, Barry Song wrote:
From: Barry Song <v-songbaohua@xxxxxxxx>

On a physical phone, it's sometimes observed that deferred_split
mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
indicates that the majority of these originate from the typical fork
scenario.
When the child process either execs or exits, the parent process should
ideally be able to reuse the entire mTHP. However, the current kernel
lacks this capability and instead places the mTHP into split_deferred,
performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.

     main()
     {
     #define SIZE 1024 * 1024UL
             void *p = malloc(SIZE);
             memset(p, 0x11, SIZE);
             if (fork() == 0)
                     exec(....);
            /*
          * this will trigger cow one subpage from
          * mTHP and put mTHP into split_deferred
          * list
          */
         *(int *)(p + 10) = 10;
         printf("done\n");
         while(1);
     }

This leads to two significant issues:

* Memory Waste: Before the mTHP is fully split by the shrinker,
it wastes memory. In extreme cases, such as with a 64KB mTHP,
the memory usage could be 64KB + 60KB until the last subpage
is written, at which point the mTHP is freed.

* Fragmentation and Performance Loss: It destroys large folios
(negating the performance benefits of CONT-PTE) and fragments memory.

To address this, we should aim to reuse the entire mTHP in such cases.

Hi David,

I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
entirely_reuse argument because I’m not sure if there are still cases
where we reuse a subpage within an mTHP. For now, I’m setting
entirely_reuse to true only for the newly supported case, while all
other cases still get false. Please let me know if this is incorrect—if
we don’t reuse subpages at all, we could remove the argument.

See [1] I sent out this week, that is able to reuse even without
scanning page tables. If we find the the folio is exclusive we could try
processing surrounding PTEs that map the same folio.

[1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@xxxxxxxxxx

Great! It looks like I missed your patch again. Since you've implemented this
in a better way, I’d prefer to use your patchset.

I wouldn't say better, just more universally. And while taking care of
properly sync'ing the mapcount vs. refcount :P


I’m curious about how you're handling ptep_set_access_flags_nr() or similar
things because I couldn’t find the related code in your patch 10/17:

[PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID

Am I missing something?

The idea is to keep individual write faults as fast as possible. So the
patch set keeps it simple and only reuses a single PTE at a time,
setting that one PAE and mapping it writable.

I got your point, thanks! as anyway the mTHP has been exclusive,
so the following nr-1 minor page faults will set their particular PTE
to writable one by one.

Yes, assuming you would get these page faults, and assuming you would
get them in the near future.



As the patch states, it might be reasonable to optimize some cases,
maybe also only on some architectures. For example to fault-around and
map the other ones writable as well. It might not always be desirable
though, especially not for larger folios.

as anyway, the mTHP has been entirely exclusive, setting all PTEs
directly to writable should help reduce nr - 1 minor page faults and
ideally help reduce CONTPTE unfold and fold?

Yes, doing that on CONTPTE granularity would very likely make sense. For
anything bigger than that, I am not sure.

Assuming we have a 1M folio mapped by PTEs. Trying to fault-around in
aligned CONTPTE granularity likely makes sense. Bigger than that, I am
not convinced.


I see. maybe we can have something like:

static bool pte_fault_around_estimate(int nr)
{
        if (nr / arch_batched_ptes_nr() < 16)
              return true;

        return false;
}

if (pte_fault_around_estimate(folio_nr_pages(folio)))
        set all ptes;

for arm64, arch_batched_ptes_nr()  == 16. for
arch without cont-pte or similar things,
arch_batched_ptes_nr()  == 1.

Yes, something like that would be my take.

After we know that we can reuse the large folio, we'll try scanning starting from the aligned PTE. If we find that we can batch, we'll batch that part. Otherwise we'll simply fallback to a single one.

Handling batching across VMAs is a bit harder. We might be able to batch, or might not ... We could have the CONT_PTE bit set across VMAs, but might not necessarily be able to batch (e.g., some VMAs are read-only).


Just some rough ideas; all the naming might be quite messy.

at least, we won't lose the benefit of reduced TLB miss
before all nr_pages are written for aarch64 :-)


What is the downside to doing that? I also don't think mapping them
all together will waste memory?

No, it's all about increasing the latency of individual write faults.


i see, i assume it won't be worse than the current case where we have to
allocate small folios and copy? and folio allocation can even further incur
direct reclamation?

Yes, it would certainly better than what we currently have. Almost everything would likely be better than what we currently have. :)

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux