Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages

Frank van der Linden <fvdl@xxxxxxxxxx> · Tue, 3 Dec 2024 10:43:42 -0800

On Tue, Dec 3, 2024 at 6:26 AM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>
> On 03/12/2024 12:06, Michal Hocko wrote:
> > On Mon 02-12-24 14:50:49, Frank van der Linden wrote:
> >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> >>> Any games with "background zeroing" are notoriously crappy and I would
> >>> argue one should exhaust other avenues before going there -- at the end
> >>> of the day the cost of zeroing will have to get paid.
> >>
> >> I understand that the concept of background prezeroing has been, and
> >> will be, met with some resistance. But, do you have any specific
> >> concerns with the patch I posted? It's pretty well isolated from the
> >> rest of the code, and optional.
> >
> > The biggest concern I have is that the overhead is payed by everybody on
> > the system - it is considered to be a system overhead regardless only
> > part of the workload benefits from hugetlb pages. In other words the
> > workload using those pages is not accounted for the use completely.
> >
> > If the startup latency is a real problem is there a way to workaround
> > that in the userspace by preallocating hugetlb pages ahead of time
> > before those VMs are launched and hand over already pre-allocated pages?
>
> It should be relatively simple to actually do this. Me and Mike had experimented
> ourselves a couple years back but we never had the chance to send it over. IIRC
> if we:
>
> - add the PageZeroed tracking bit when a page is zeroed
> - clear it in the write (fixup/non-fixup) fault-path
>
> [somewhat similar to this series I suspect]
>
> Then what's left is to change the lookup of free hugetlb pages
> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> free them back 'as zeroed' to implement a userspace scrubber. And in principle
> existing apps should see no difference. The amount of changes is consequently
> significantly smaller (or it looked as such in a quick PoC years back).

This would work, and is easy to do, but:
  * You now have a userspace daemon that depends on kernel-internal behavior.
  * It has no way to track how much work is left to do or what needs
to be done (unless it is part of an application that is the only user
of hugetlbfs on the system).

>
> Something extra on the top would perhaps be the ability so select a lookup
> heuristic such that we can pick the search method of
> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> without too much of a dance.

Again, that would probably work, but if you take a step back: you now
have a kernel behavior that can be guided in certain directions, but
no guarantees and no stats to see if things are working out. And an
explicit allocation method option (basically: take from the head or
the tail of the freelist). The picture is getting murkier. At least
with the patch I sent you have a clearly defined, optional, behavior
that can be switched on or off, and stats to see if it's working.

I do understand the argument against having pre-zeroing not being
accounted to the current thread. I would counter that benefiting from
work by kernel threads is not unheard of in the kernel today already.
Also, the other proposals so far also have another thread doing the
zeroing - it just is explicitly started by userspace. So, the cost is
still not paid by the user of the pages. You just end up with
explicitly controlling who does pay the cost. Which I suppose is
better, but it's still not trivial to get it completely right (you
perhaps could do it at the container level with some trickery).

What we have done so far is to bind the khzerod threads introduced in
this patch to CPUs in such a way that it doesn't interfere with the
rest of the system. Which you would also have to do with any userspace
solution.

Again, this is optional - if you are a system manager who prefers to
have the resources used by zeroing hugetlb pages to be explicitly
accounted to the actual user, you can not enable this behavior (it's
off by default).

I guess I can summarize my thoughts like this: while I understand the
argument against doing this outside of the context of the actual user
of the pages, this is 1) optional, and 2) so far the other solutions
introduce interfaces that I don't think are that great, or would
require maintaining a hugetlb 'shadow pool' in userspace through
hugetlbfs files.

- Frank