Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages

Frank van der Linden <fvdl@xxxxxxxxxx> · Mon, 2 Dec 2024 14:50:49 -0800

On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
> > Fresh hugetlb pages are zeroed out when they are faulted in,
> > just like with all other page types. This can take up a good
> > amount of time for larger page sizes (e.g. around 40
> > milliseconds for a 1G page on a recent AMD-based system).
> >
> > This normally isn't a problem, since hugetlb pages are typically
> > mapped by the application for a long time, and the initial
> > delay when touching them isn't much of an issue.
> >
> > However, there are some use cases where a large number of hugetlb
> > pages are touched when an application (such as a VM backed by these
> > pages) starts. For 256 1G pages and 40ms per page, this would take
> > 10 seconds, a noticeable delay.
>
> The current huge page zeroing code is not that great to begin with.
>
> There was a patchset posted some time ago to remedy at least some of it:
> https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@xxxxxxxxxx/
>
> but it apparently fell through the cracks.

Hi Mateusz, thanks for your reply.

I am aware of that patch set, yes. The discussion around it evolved in
to one about kernel preemption and the evilness of cond_resched().

You can certainly improve the time it takes to zero out a 1G page by
optimizing the code that does it. See also, for example,
https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@xxxxxxxxxx/

However, while, say, a 50% improvement in zero out time, at the max,
is nice, this still leaves the faulting process spending considerable
time doing it. Like you say, that's cost that needs to be paid - but
it would be good to avoid paying it inline. This patch avoids doing
that altogether, leading to a basically 100% improvement under
reasonably good circumstances.

>
> Any games with "background zeroing" are notoriously crappy and I would
> argue one should exhaust other avenues before going there -- at the end
> of the day the cost of zeroing will have to get paid.

I understand that the concept of background prezeroing has been, and
will be, met with some resistance. But, do you have any specific
concerns with the patch I posted? It's pretty well isolated from the
rest of the code, and optional.

>
> To that end I would suggest picking up the patchset and experimenting
> with more variants of the zeroing code (for example for 1G it may be it
> is faster to employ SIMD usage in the routine).

See above - happy to pick up older patch(es) as a separate effort, but
they won't fully solve the issue for the scenario I'm describing.

>
> If this is really such a problem I wonder if this could start as a
> series of 2MB pages instead faulted as needed, eventually promoted to
> 1G after passing some threshold?

This idea sounds similar to HGM (high granularity mapping), an idea
which was originally posted for the purpose of live migration of VMs
(but never made it in). It's not trivial, and seems like overkill.
Again, my patch is non-invasive and optional, so I think it's better
in that regard.

- Frank