On Tue, Dec 3, 2024 at 6:26 AM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > > On 03/12/2024 12:06, Michal Hocko wrote: > > On Mon 02-12-24 14:50:49, Frank van der Linden wrote: > >> On Mon, Dec 2, 2024 at 1:58 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > >>> Any games with "background zeroing" are notoriously crappy and I would > >>> argue one should exhaust other avenues before going there -- at the end > >>> of the day the cost of zeroing will have to get paid. > >> > >> I understand that the concept of background prezeroing has been, and > >> will be, met with some resistance. But, do you have any specific > >> concerns with the patch I posted? It's pretty well isolated from the > >> rest of the code, and optional. > > > > The biggest concern I have is that the overhead is payed by everybody on > > the system - it is considered to be a system overhead regardless only > > part of the workload benefits from hugetlb pages. In other words the > > workload using those pages is not accounted for the use completely. > > > > If the startup latency is a real problem is there a way to workaround > > that in the userspace by preallocating hugetlb pages ahead of time > > before those VMs are launched and hand over already pre-allocated pages? > > It should be relatively simple to actually do this. Me and Mike had experimented > ourselves a couple years back but we never had the chance to send it over. IIRC > if we: > > - add the PageZeroed tracking bit when a page is zeroed > - clear it in the write (fixup/non-fixup) fault-path > > [somewhat similar to this series I suspect] > > Then what's left is to change the lookup of free hugetlb pages > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed > pages. Provided we don't track its 'cleared' state, there's no UAPI change in > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and > free them back 'as zeroed' to implement a userspace scrubber. And in principle > existing apps should see no difference. The amount of changes is consequently > significantly smaller (or it looked as such in a quick PoC years back). This would work, and is easy to do, but: * You now have a userspace daemon that depends on kernel-internal behavior. * It has no way to track how much work is left to do or what needs to be done (unless it is part of an application that is the only user of hugetlbfs on the system). > > Something extra on the top would perhaps be the ability so select a lookup > heuristic such that we can pick the search method of > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) > without too much of a dance. Again, that would probably work, but if you take a step back: you now have a kernel behavior that can be guided in certain directions, but no guarantees and no stats to see if things are working out. And an explicit allocation method option (basically: take from the head or the tail of the freelist). The picture is getting murkier. At least with the patch I sent you have a clearly defined, optional, behavior that can be switched on or off, and stats to see if it's working. I do understand the argument against having pre-zeroing not being accounted to the current thread. I would counter that benefiting from work by kernel threads is not unheard of in the kernel today already. Also, the other proposals so far also have another thread doing the zeroing - it just is explicitly started by userspace. So, the cost is still not paid by the user of the pages. You just end up with explicitly controlling who does pay the cost. Which I suppose is better, but it's still not trivial to get it completely right (you perhaps could do it at the container level with some trickery). What we have done so far is to bind the khzerod threads introduced in this patch to CPUs in such a way that it doesn't interfere with the rest of the system. Which you would also have to do with any userspace solution. Again, this is optional - if you are a system manager who prefers to have the resources used by zeroing hugetlb pages to be explicitly accounted to the actual user, you can not enable this behavior (it's off by default). I guess I can summarize my thoughts like this: while I understand the argument against doing this outside of the context of the actual user of the pages, this is 1) optional, and 2) so far the other solutions introduce interfaces that I don't think are that great, or would require maintaining a hugetlb 'shadow pool' in userspace through hugetlbfs files. - Frank