Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Further, we have to be a bit careful regarding replacing ranges that are backed
by different anon pages (for example, due to fork() deciding to copy some
sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.

During fork(), what could happen (temporary detection of pinned page resulting
in a copy) is something weird like:

PTE 0: subpage0 of anon page #1 (maybe shared)
PTE 1: subpage1 of anon page #1 (maybe shared
PTE 2: anon page #2 (exclusive)
PTE 3: subpage2 of anon page #1 (maybe shared

Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap
something new in PTE2. But I don't see how it happens at fork. For PTE3, did you
mean subpage _3_?


Yes, fat fingers :) Thanks for paying attention!

Above could be optimized by processing all consecutive PTEs at once: meaning, we check if the page maybe pinned only once, and then either copy all PTEs or share all PTEs. It's unlikely to happen in practice, I guess, though.



Of course, any combination of above.

Further, with mremap() we might get completely crazy layouts, randomly mapping
sub-pages of anon pages, mixed with other sub-pages or base-page folios.

Maybe it's all handled already by your code, just pointing out which kind of
mess we might get :)

Yep, this is already handled; the scan to expand the range ensures that all the
PTEs map to the expected contiguous pages in the same folio.

Okay, great.






So what should be safe is replacing all sub-pages of a folio that are marked
"maybe shared" by a new folio under PT lock. However, I wonder if it's really
worth the complexity. For THP we were happy so far to *not* optimize this,
implying that maybe we shouldn't worry about optimizing the fork() case for now
that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

In which test exactly? Some micro-benchmark?

The kernel compile benchmark that I quoted numbers for in the cover letter. I
have some trace points (not part of the submitted series) that tell me how many
mappings of each order we get for each code path. I'm pretty sure I remember all
of these 4 code paths contributing non-negligible amounts.

Interesting! It would be great to see if there is an actual difference after patch #10 was applied without the other COW replacement.

--
Thanks,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux