Hey David, Thanks for taking the time to review! David Hildenbrand <david@xxxxxxxxxx> 于2024年1月18日周四 02:41写道: > > On 17.01.24 18:10, Zach O'Keefe wrote: > > [+linux-mm & others] > > > > On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@xxxxxxxxx> wrote: > >> > >> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > >> > >> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > >> make a least-effort attempt at a synchronous collapse of memory at > >> their own expense. > >> > >> The only difference from MADV_COLLAPSE is that the new hugepage allocation > >> avoids direct reclaim and/or compaction, quickly failing on allocation errors. > >> > >> The benefits of this approach are: > >> > >> * CPU is charged to the process that wants to spend the cycles for the THP > >> * Avoid unpredictable timing of khugepaged collapse > >> * Prevent unpredictable stalls caused by direct reclaim and/or compaction > >> > >> Semantics > >> > >> This call is independent of the system-wide THP sysfs settings, but will > >> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > >> multiple VMAs, the semantics of the collapse over each VMA is independent > >> from the others. This implies a hugepage cannot cross a VMA boundary. If > >> collapse of a given hugepage-aligned/sized region fails, the operation may > >> continue to attempt collapsing the remainder of memory specified. > >> > >> The memory ranges provided must be page-aligned, but are not required to > >> be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > >> start/end of the range will be clamped to the first/last hugepage-aligned > >> address covered by said range. The memory ranges must span at least one > >> hugepage-sized region. > >> > >> All non-resident pages covered by the range will first be > >> swapped/faulted-in, before being internally copied onto a freshly > >> allocated hugepage. Unmapped pages will have their data directly > >> initialized to 0 in the new hugepage. However, for every eligible > >> hugepage aligned/sized region to-be collapsed, at least one page must > >> currently be backed by memory (a PMD covering the address range must > >> already exist). > >> > >> Allocation for the new hugepage will not enter direct reclaim and/or > >> compaction, quickly failing if allocation fails. When the system has > >> multiple NUMA nodes, the hugepage will be allocated from the node providing > >> the most native pages. This operation operates on the current state of the > >> specified process and makes no persistent changes or guarantees on how pages > >> will be mapped, constructed, or faulted in the future. > >> > >> Return Value > >> > >> If all hugepage-sized/aligned regions covered by the provided range were > >> either successfully collapsed, or were already PMD-mapped THPs, this > >> operation will be deemed successful. On success, madvise(2) returns 0. > >> Else, -1 is returned and errno is set to indicate the error for the > >> most-recently attempted hugepage collapse. Note that many failures might > >> have occurred, since the operation may continue to collapse in the event a > >> single hugepage-sized/aligned region fails. > >> > >> ENOMEM Memory allocation failed or VMA not found > >> EBUSY Memcg charging failed > >> EAGAIN Required resource temporarily unavailable. Try again > >> might succeed. > >> EINVAL Other error: No PMD found, subpage doesn't have Present > >> bit set, "Special" page no backed by struct page, VMA > >> incorrectly sized, address not page-aligned, ... > >> > >> Use Cases > >> > >> An immediate user of this new functionality is the Go runtime heap allocator > >> that manages memory in hugepage-sized chunks. In the past, whether it was a > >> newly allocated chunk through mmap() or a reused chunk released by > >> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > >> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > >> respectively. However, both approaches resulted in performance issues; for > >> both scenarios, there could be entries into direct reclaim and/or compaction, > >> leading to unpredictable stalls[4]. Now, the allocator can confidently use > >> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. > >> > >> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > >> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > >> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > >> [4] https://github.com/golang/go/issues/63334 > > > > Thanks for the patch, Lance, and thanks for providing the links above, > > referring to issues Go has seen. > > > > I've reached out to the Go team to try and understand their use case, > > and how we could help. It's not immediately clear whether a > > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to > > be. > > > > That said, with respect to the implementation, should a need for a > > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see > > process_madvise(2) be the "v2" of madvise(2), where we can start > > leveraging the forward-facing flags argument for these different > > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa > > ("mm/madvise: remove racy mm ownership check") so that > > process_madvise(2) can always operate on self. IIRC, this was ~ the > > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a > > sane default, and implement options in flags down the line). > > +1, using process_madvise() would likely be the right approach. Thanks for your suggestion! I completely agree :) Lance > > -- > Cheers, > > David / dhildenb >