On Thu, Aug 11, 2022 at 2:55 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > On Thu, Aug 11, 2022 at 1:20 PM Alex Zhu (Kernel) <alexlzhu@xxxxxx> wrote: > > > > Hi Yu, > > > > I’ve updated your patch set from last year to work with folio and am testing it now. The functionality in split_huge_page() is the same as what I have. Was there any follow up work done later? > > Yes, but it won't change the landscape any time soon (see below). So > please feel free to continue along your current direction. > > > If not, I would like to incorporate this into what I have, and then resubmit. Will reference the original patchset. We need this functionality for the shrinker, but even the changes to split_huge_page() by itself it should show some performance improvement when used by the existing deferred_split_huge_page(). > > SGTM. Thanks! > > A side note: > > I'm working on a new mode: THP=auto, meaning the kernel will detect > internal fragmentation of 2MB compound pages to decide whether to map > them by PMDs or split them under memory pressure. The general workflow > of this new mode is as follows. I tend to agree that avoiding allocating THP in the first place is the preferred way to avoid internal fragmentation. But I got some questions about your design/implementation: > > In the page fault path: > 1. Compound pages are allocated as usual. > 2. Each is mapped by 512 consecutive PTEs rather than a PMD. > 3. There will be more TLB misses but the same number of page faults. > 4. TLB coalescing can mitigate the performance degradation. Why not just allocate base pages in the first place? Khugepaged has max_pte_none tunable to detect internal fragmentation. If you worry about zero page, you could add max_pte_zero tunable. Or did you investigate whether the new MADV_COLLAPSE may be helpful or not? It leaves the decision to the userspace. > > In khugepaged: > 1. Check the dirty bit in the PTEs mapping a compound page, to > determine its utilization. > 2. Remap compound pages that meet a certain utilization threshold by > PMDs in place, i.e., no migrations. > > In the reclaim path, e.g., MGLRU page table scanning: > 1. Decide whether compound pages mapped by PTEs should be split based > on their utilizations and memory pressure, e.g., reclaim priority. > 2. Clean subpages should be freed directly after split, rather than swapped out. > > N.B. > 1. This workflow relies on the dirty bit rather examining the content of a page. > 2. Sampling can be done by periodically switching between a PMD and > 512 consecutive PTEs. > 3. It only needs to hold mmap_lock for read because this special mode > (512 consecutive PTEs) is not considered the split mode. > 4. Don't hold your breath :) > > Other references: > 1. https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf > 2. https://www.usenix.org/system/files/osdi21-hunter.pdf