On 1/10/2023 10:40 PM, David Hildenbrand wrote: > On 10.01.23 04:57, Yin, Fengwei wrote: >> >> >> On 1/10/2023 1:33 AM, David Hildenbrand wrote: >>> On 09.01.23 08:22, Yin Fengwei wrote: >>>> In a nutshell: 4k is too small and 2M is too big. We started >>>> asking ourselves whether there was something in the middle that >>>> we could do. This series shows what that middle ground might >>>> look like. It provides some of the benefits of THP while >>>> eliminating some of the downsides. >>>> >>>> This series uses "multiple consecutive pages" (mcpages) of >>>> between 8K and 2M of base pages for anonymous user space mappings. >>>> This will lead to less internal fragmentation versus 2M mappings >>>> and thus less memory consumption and wasted CPU time zeroing >>>> memory which will never be used. >>> >>> Hi, > > Hi, > >>> >>> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache, >> swapping etc. > > Right, PTE-mapping higher-order pages, in a faultaround fashion. But for pagecache etc. that doesn't require mcpage IMHO. I think it's the natural evolution of folios that Willy envisioned at some point. Agree. > >> >>> >>> Some thoughts: >>> >>> (1) Faultaround might be unexpected for some workloads and increase >>> memory consumption unnecessarily. >> Comparing to THP, the memory consumption and latency introduced by >> mcpage is minor. > > But it exists :) Yes. There is extra memory consumption even it's minor. > >> >>> >>> Yes, something like that can happen with THP BUT >>> >>> (a) THP can be disabled or is frequently only enabled for madvised >>> regions -- for example, exactly for this reason. >>> (b) Some workloads (especially memory ballooning) rely on memory not >>> suddenly re-appearing after MADV_DONTNEED. This works even with THP, >>> because the 4k MADV_DONTNEED will first PTE-map the THP. Because >>> there is a PTE page table, we won't suddenly get a THP populated >>> again (unless khugepaged is configured to fill holes). >>> >>> >>> I strongly assume we will need something similar to force-disable, selectively-enable etc. >> Agree. > > Thinking again, we might want to piggy-back on the THP machinery/config knobs completely, hmm. After all, it's a similar concept to a THP (once we properly handle folios), just that we are not able to PMD-map the folio because it is too small. > > Some applications that trigger MADV_NOHUGEPAGE don't want to get more pages populated than actually accessed. userfaultfd users come to mind, where we might not even have the guaranteed to see a UFFD registration before enabling MADV_NOHUGEPAGE and filling out some pages ... if we'd populate too many PTEs, we could miss uffd faults later ... This is good point. > >> >>> >>> >>> (2) This steals consecutive pages to immediately split them up >>> >>> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users. >>> >> The motivations to split it immediately are: >> 1. All the sub-pages is just normal 4K page. No other changes need be >> added to handle it. >> 2. splitting it before use doesn't involved complicated page lock handling. > > I think for an upstream version we really want to avoid these splits. OK. > >>>> >>>> In the implementation, we allocate high order page with order of >>>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the >>>> physical contiguous memory is used and benefit sequential memory >>>> access latency. >>>> >>>> Then split the high order page. By doing this, the sub-page of >>>> mcpage is just 4K normal page. The current kernel page >>>> management is applied to "mc" pages without any changes. Batching >>>> page faults is allowed with mcpage and reduce page faults number. >>>> >>>> There are costs with mcpage. Besides no TLB benefit THP brings, it >>>> increases memory consumption and latency of allocation page >>>> comparing to 4K base page. >>>> >>>> This series is the first step of mcpage. The furture work can be >>>> enable mcpage for more components like page cache, swapping etc. >>>> Finally, most pages in system will be allocated/free/reclaimed >>>> with mcpage order. >>> >>> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio. >>> >>> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality. >>> >>> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page. >> "cont-pte" on ARM64 has fixed size which match the silicon definition. >> mcpage allows software/user to define the size which is not necessary >> to be exact same as silicon defined. Thanks. > > Yes. And the whole concept is abstracted away: it's logically a single, larger PTE, and we can only map/unmap in that PTE granularity. David, thanks a lot for the comments. Regards Yin, Fengwei >