On 1/10/2023 1:33 AM, David Hildenbrand wrote: > On 09.01.23 08:22, Yin Fengwei wrote: >> In a nutshell: 4k is too small and 2M is too big. We started >> asking ourselves whether there was something in the middle that >> we could do. This series shows what that middle ground might >> look like. It provides some of the benefits of THP while >> eliminating some of the downsides. >> >> This series uses "multiple consecutive pages" (mcpages) of >> between 8K and 2M of base pages for anonymous user space mappings. >> This will lead to less internal fragmentation versus 2M mappings >> and thus less memory consumption and wasted CPU time zeroing >> memory which will never be used. > > Hi, > > what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache, swapping etc. > > Some thoughts: > > (1) Faultaround might be unexpected for some workloads and increase > memory consumption unnecessarily. Comparing to THP, the memory consumption and latency introduced by mcpage is minor. > > Yes, something like that can happen with THP BUT > > (a) THP can be disabled or is frequently only enabled for madvised > regions -- for example, exactly for this reason. > (b) Some workloads (especially memory ballooning) rely on memory not > suddenly re-appearing after MADV_DONTNEED. This works even with THP, > because the 4k MADV_DONTNEED will first PTE-map the THP. Because > there is a PTE page table, we won't suddenly get a THP populated > again (unless khugepaged is configured to fill holes). > > > I strongly assume we will need something similar to force-disable, selectively-enable etc. Agree. > > > (2) This steals consecutive pages to immediately split them up > > I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users. > The motivations to split it immediately are: 1. All the sub-pages is just normal 4K page. No other changes need be added to handle it. 2. splitting it before use doesn't involved complicated page lock handling. > > (3) All effort will be lost once page compaction gets active, compacts, > and simply migrates to random 4k pages. This is most probably the > biggest "issue" of the whole approach AFAIKS: it's only temporary > because there is no notion of these pages belonging together > anymore. Yes. But I suppose page compaction could be updated to handle mcpage. Like always handle all sub-pages together. We did experience for reclaim. > >> >> In the implementation, we allocate high order page with order of >> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the >> physical contiguous memory is used and benefit sequential memory >> access latency. >> >> Then split the high order page. By doing this, the sub-page of >> mcpage is just 4K normal page. The current kernel page >> management is applied to "mc" pages without any changes. Batching >> page faults is allowed with mcpage and reduce page faults number. >> >> There are costs with mcpage. Besides no TLB benefit THP brings, it >> increases memory consumption and latency of allocation page >> comparing to 4K base page. >> >> This series is the first step of mcpage. The furture work can be >> enable mcpage for more components like page cache, swapping etc. >> Finally, most pages in system will be allocated/free/reclaimed >> with mcpage order. > > I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio. > > I'd summarize the ideas as "faultaround" whereby we try optimizing for locality. > > Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page. "cont-pte" on ARM64 has fixed size which match the silicon definition. mcpage allows software/user to define the size which is not necessary to be exact same as silicon defined. Thanks. Regards Yin, Fengwei >