On 30/07/2024 17:11, David Hildenbrand wrote: > On 30.07.24 17:19, Usama Arif wrote: >> >> >> On 30/07/2024 16:14, Usama Arif wrote: >>> >>> >>> On 30/07/2024 15:35, David Hildenbrand wrote: >>>> On 30.07.24 14:45, Usama Arif wrote: >>>>> The current upstream default policy for THP is always. However, Meta >>>>> uses madvise in production as the current THP=always policy vastly >>>>> overprovisions THPs in sparsely accessed memory areas, resulting in >>>>> excessive memory pressure and premature OOM killing. >>>>> Using madvise + relying on khugepaged has certain drawbacks over >>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and >>>>> require userspace changes. Waiting for khugepaged to scan memory and >>>>> collapse pages into THP can be slow and unpredictable in terms of performance >>>>> (i.e. you dont know when the collapse will happen), while production >>>>> environments require predictable performance. If there is enough memory >>>>> available, its better for both performance and predictability to have >>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged >>>>> to collapse it, and deal with sparsely populated THPs when the system is >>>>> running out of memory. >>>>> >>>>> This patch-series is an attempt to mitigate the issue of running out of >>>>> memory when THP is always enabled. During runtime whenever a THP is being >>>>> faulted in or collapsed by khugepaged, the THP is added to a list. >>>>> Whenever memory reclaim happens, the kernel runs the deferred_split >>>>> shrinker which goes through the list and checks if the THP was underutilized, >>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled. >>>>> If this number goes above a certain threshold, the shrinker will attempt >>>>> to split that THP. Then at remap time, the pages that were zero-filled are >>>>> not remapped, hence saving memory. This method avoids the downside of >>>>> wasting memory in areas where THP is sparsely filled when THP is always >>>>> enabled, while still providing the upside THPs like reduced TLB misses without >>>>> having to use madvise. >>>>> >>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were >>>>> tested with THP shrinker. The results after 2 hours are as follows: >>>>> >>>>> | THP=madvise | THP=always | THP=always >>>>> | | | + shrinker series >>>>> | | | + max_ptes_none=409 >>>>> ----------------------------------------------------------------------------- >>>>> Performance improvement | - | +1.8% | +1.7% >>>>> (over THP=madvise) | | | >>>>> ----------------------------------------------------------------------------- >>>>> Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) >>>>> ----------------------------------------------------------------------------- >>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512 >>>>> (80%) zero filled filled pages will be split. >>>>> >>>>> To test out the patches, the below commands without the shrinker will >>>>> invoke OOM killer immediately and kill stress, but will not fail with >>>>> the shrinker: >>>>> >>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none >>>>> mkdir /sys/fs/cgroup/test >>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs >>>>> echo 20M > /sys/fs/cgroup/test/memory.max >>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max >>>>> # allocate twice memory.max for each stress worker and touch 40/512 of >>>>> # each THP, i.e. vm-stride 50K. >>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM >>>>> # killer. >>>>> # Without the shrinker, OOM killer is invoked immediately irrespective >>>>> # of max_ptes_none value and kill stress. >>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K >>>>> >>>>> Patches 1-2 add back helper functions that were previously removed >>>>> to operate on page lists (needed by patch 3). >>>>> Patch 3 is an optimization to free zapped tail pages rather than >>>>> waiting for page reclaim or migration. >>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled >>>>> subpages when splitting THP. >>>>> Patches 6 adds support for THP shrinker. >>>>> >>>>> (This patch-series restarts the work on having a THP shrinker in kernel >>>>> originally done in >>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@xxxxxx/. >>>>> The THP shrinker in this series is significantly different than the >>>>> original one, hence its labelled v1 (although the prerequisite to not >>>>> remap clean subpages is the same).) >>>> >>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries. >>>> >>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages. >>>> >>>> Is that being done in this patch set already, or are we creating pte_none() entries? >>>> >>> >>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma. >>> >>> if (userfaultfd_armed(pvmw->vma)) { >>> newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)), >>> pvmw->vma->vm_page_prot)); >>> ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte); >>> set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); >>> } >> >> >> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case. > > I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following > > 1) mmap(), enable THP > 2) Migrate a bunch of pages from the source during precopy (writing to > the memory). Might end up creating THPs (during fault/khugepaged) > 3) Register UFFD on the VMA > 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA > 5) Discard any pages that have been re-dirtied or not migrated yet > 6) Migrate-on-demand any holes using uffd > > > If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated). > > I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/ > > Thanks for the example! Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source. If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive. diff --git a/mm/migrate.c b/mm/migrate.c index 2731ac20ff33..52aa4770fbed 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -206,14 +206,10 @@ static bool try_to_unmap_unused(struct page_vma_mapped_walk *pvmw, if (dirty) return false; - pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false); - - if (userfaultfd_armed(pvmw->vma)) { - newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)), - pvmw->vma->vm_page_prot)); - ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte); - set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); - } + newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)), + pvmw->vma->vm_page_prot)); + ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte); + set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); return true;