On Tue, Dec 3, 2024 at 4:05 PM Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: > > > Mateusz Guzik <mjguzik@xxxxxxxxx> writes: > > > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote: > >> Fresh hugetlb pages are zeroed out when they are faulted in, > >> just like with all other page types. This can take up a good > >> amount of time for larger page sizes (e.g. around 40 > >> milliseconds for a 1G page on a recent AMD-based system). > >> > >> This normally isn't a problem, since hugetlb pages are typically > >> mapped by the application for a long time, and the initial > >> delay when touching them isn't much of an issue. > >> > >> However, there are some use cases where a large number of hugetlb > >> pages are touched when an application (such as a VM backed by these > >> pages) starts. For 256 1G pages and 40ms per page, this would take > >> 10 seconds, a noticeable delay. > > > > The current huge page zeroing code is not that great to begin with. > > Yeah definitely suboptimal. The current huge page zeroing code is > both slow and it trashes the cache while zeroing. > > > There was a patchset posted some time ago to remedy at least some of it: > > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@xxxxxxxxxx/ > > > > but it apparently fell through the cracks. > > As Joao mentioned that got side tracked due to the preempt-lazy stuff. > Now that lazy is in, I plan to follow up on the zeroing work. > > > Any games with "background zeroing" are notoriously crappy and I would > > argue one should exhaust other avenues before going there -- at the end > > of the day the cost of zeroing will have to get paid. > > Yeah and the background zeroing has dual cost: the cost in CPU time plus > the indirect cost to other processes due to the trashing of L3 etc. I'm not sure what you mean here - any caching side effects of zeroing happen regardless of who does it, right? It doesn't matter if it's a kthread or the calling thread. If you're concerned about the caching side effects in general, using non-temporal instructions helps (e.g. movnti on x86). See the link I mentioned for a patch that was sent years ago ( https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@xxxxxxxxxx/ ). Using movnti on x86 definitely helps performance (up to 50% in my experiments). Which is great, but it still leaves considerable delay for the use case I mentioned. - Frank