huge zero page confusion

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 3 Jul 2024 18:37:48 +0100

Hi Kirill, Hugh, Mel,

We recently had a problem reported at [1] that due to aarch64 arch requiring
that atomic RMW instructions raise a read fault, followed by a write fault, this
causes a huge zero page to be faulted in during the read fault, then the write
fault shatters the huge zero page, installing small zero pages for every PTE in
the PMD region, except the faulting address which gets a writable private page.

A number of ways were discussed to solve that problem. But it got me wondering
why we have this behaviour in general for huge zero page? This seems like odd
behaviour to me. Surely it would be less effort and more aligned with the app's
expectations to notice the huge zero page in the PMD, remove it, and install a
THP, as would have been done if pmd_none() was true? Or if there is a reason to
shatter on write, why not do away with the huge zero page and save some memory,
and just install a PMD's worth of small zero pages on fault?

Perhaps replacing the huge zero page with a huge THP on write fault would have
been a better behavior at the time, but perhaps changing that behaviour now
risks a memory bloat regression in some workloads?

I had some brief discussion with David H starting at [2].

Would appreciate your thoughts!

[1]
https://lore.kernel.org/all/20240626191830.3819324-1-yang@xxxxxxxxxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/all/3743d7e1-0b79-4eaf-82d5-d1ca29fe347d@xxxxxxx/

Thanks,
Ryan