On Mon, Mar 6, 2023 at 11:19 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > This is past the deadline, so feel free to ignore. However, ... > > James Houghton has been working on the concept of HugeTLB High Granularity > Mapping (HGM) as discussed here: > https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@xxxxxxxxxx/ > > The primary motivation for this work is post-copy live migration of VMs backed > by hugetlb pages via userfaultfd. A followup use case is more gracefully > handling memory errors/poison on hugetlb pages. > > As can be seen by the size of James's patch set, the required changes for > HGM are a bit complex and involved. This is also complicated the need > choosing a 'mapcount strategy' as the previous scheme used by hugetlb > will no longer work. > > A HGM for hugetlbfs session would present the current approach and challenges. > While much of the work is confined to hugetlb, there is a bit spill over to > other mm areas: specifically page table walking. A discussion on ways to > move forward with this effort would be appreciated. > -- > Mike Kravetz Hi everyone, If you came to the HGM session at LSF/MM/BPF, thank you! I want to address some of the feedback I got and restate the importance of HGM, especially as it relates to handling memory poison. ## Memory poison is a problem HGM allows us to unmap poison at 4K instead of unmapping the entire hugetlb page. For applications that use HugeTLB, losing the entire hugepage can be catastrophic. For example, if a hypervisor is using 1G pages for guest memory, the VM will lose 1G of its physical address space, which is catastrophic (even 2M will most likely kill the VM). If we can limit the poisoning to only 4K, the VM will most likely be able to recover. This improved recoverability applies to other HugeTLB users as well, like databases. ## Adding a new filesystem has risks, and unification will take years Most of the feedback I got from the HGM session was to simply avoid adding new code to HugeTLB, and instead to make a new device or filesystem. Creating a new device or filesystem could work, but it leaves existing HugeTLB users with no answer for memory poison. Users would need to switch to the new device/filesystem if they want better hwpoison handling, and it will probably take years for the new device/filesystem to support all the features that HugeTLB supports today (so beyond PUD+ mappings, we would need page table sharing, page struct freeing, and even private mappings/CoW). If we make a new filesystem and are unable to completely implement the HugeTLB uapi exactly with that filesystem, we will be stuck unable to remove HugeTLB. We would strongly like to avoid coexisting HugeTLB implementations (similar to cgroup v1 and cgroup v2) if at all possible. Instead of making a new filesystem, we could add HugeTLB-like features tmpfs, such as support for gigantic page allocations (from bootmem or CMA, like HugeTLB), for example. This path would work to mostly unify HugeTLB with tmpfs, but existing HugeTLB users will still have to wait for many years before poison can be handled more efficiently. (And some users care about things like hugetlb_cgroup!) ## HGM doesn’t hinder future unification HGM doesn’t add any new special cases into mm code; it takes advantage of the existing special cases that already exist to support HugeTLB. HGM also isn’t adding a completely novel feature that can’t be replicated by THPs: PTE-mapping of THPs is already supported. HGM solves a problem that HugeTLB users have right now: unnecessarily large portions of memory are poisoned. Unless we fix HugeTLB itself, we will have to spend years effectively rewriting HugeTLB and telling users to switch to the new system that gets built. Given all this, I think we should continue to move forward with HGM unless there is another feasible way to solve poisoning for existing HugeTLB users. Also, I encourage everyone to read the series itself (it's not all that complicated!). - James