On Tue, Dec 3, 2024 at 4:57 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > > > > On 03/12/2024 12:06, Michal Hocko wrote: > > > If the startup latency is a real problem is there a way to workaround > > > that in the userspace by preallocating hugetlb pages ahead of time > > > before those VMs are launched and hand over already pre-allocated pages? > > > > It should be relatively simple to actually do this. Me and Mike had experimented > > ourselves a couple years back but we never had the chance to send it over. IIRC > > if we: > > > > - add the PageZeroed tracking bit when a page is zeroed > > - clear it in the write (fixup/non-fixup) fault-path > > > > [somewhat similar to this series I suspect] > > > > Then what's left is to change the lookup of free hugetlb pages > > (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed > > pages. Provided we don't track its 'cleared' state, there's no UAPI change in > > behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and > > free them back 'as zeroed' to implement a userspace scrubber. And in principle > > existing apps should see no difference. The amount of changes is consequently > > significantly smaller (or it looked as such in a quick PoC years back). > > > > Something extra on the top would perhaps be the ability so select a lookup > > heuristic such that we can pick the search method of > > non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic > > UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc) > > without too much of a dance. > > > > Ye after the qemu prefaulting got pointed out I started thinking about > a userlevel daemon which would do the work proposed here. > > Except I got stuck at a good way to do it. The mmap + load from the > area + munmap triple does work but also entails more overhead than > necessary, but I only have some handwaving how to not do it. :) > > Suppose a daemon of the sort exists and there is a machine with 4 or > more NUMA domains to deal with. Further suppose it spawns at least one > thread per such domain and tasksets them accordingly. > > Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter > indicating how many pages to zero out (or even just accept one page). > This would avoid crap on munmap. > > This would still need majority of the patch, but all the zeroing > policy would be taken out. Key point being that whatever specific > behavior one sees fit, they can implement it in userspace, preventing > future kernel patches to add more tweaks. How about this for a rough sketch (which I have 0 intention of implementing myself): /dev/hugepagectl or whatever is created with a bunch of ioctls, notably: - something to query hugepage stats - an event generated for epoll if count in any domain goes below a threshold - something to zero a page of given size from the free list Perhaps make it so that fds require an upfront ioctl to set a numa domain of interest before poll works -- for example if there is one thread per domain, each of them sleeps on its own relevant fd. Or maybe someone still wants the main thread to get the full view so they poll on all of them. then a google internal tool can react however it sees fit without waking up in a periodic fashion. (replace google with any other company which may want to mess with this). optional: - allocating and zeroing (but not mmaping!) a page then a party which shares the file descriptor could obtain it by passing the fd to mmap. munmap would just free it as it does now. this would allow qemu et al to avoid the mmap/munmap dance just to zero, but I don't know how useful it is for them -- Mateusz Guzik <mjguzik gmail.com>