Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 6, 2023 at 11:19 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
>
> This is past the deadline, so feel free to ignore.  However, ...
>
> James Houghton has been working on the concept of HugeTLB High Granularity
> Mapping (HGM) as discussed here:
> https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@xxxxxxxxxx/
>
> The primary motivation for this work is post-copy live migration of VMs backed
> by hugetlb pages via userfaultfd.  A followup use case is more gracefully
> handling memory errors/poison on hugetlb pages.
>
> As can be seen by the size of James's patch set, the required changes for
> HGM are a bit complex and involved.  This is also complicated the need
> choosing a 'mapcount strategy' as the previous scheme used by hugetlb
> will no longer work.
>
> A HGM for hugetlbfs session would present the current approach and challenges.
> While much of the work is confined to hugetlb, there is a bit spill over to
> other mm areas: specifically page table walking.  A discussion on ways to
> move forward with this effort would be appreciated.
> --
> Mike Kravetz

Hi everyone,

If you came to the HGM session at LSF/MM/BPF, thank you! I want to
address some of the feedback I got and restate the importance of HGM,
especially as it relates to handling memory poison.

## Memory poison is a problem

HGM allows us to unmap poison at 4K instead of unmapping the entire
hugetlb page. For applications that use HugeTLB, losing the entire
hugepage can be catastrophic. For example, if a hypervisor is using 1G
pages for guest memory, the VM will lose 1G of its physical address
space, which is catastrophic (even 2M will most likely kill the VM).
If we can limit the poisoning to only 4K, the VM will most likely be
able to recover. This improved recoverability applies to other HugeTLB
users as well, like databases.

## Adding a new filesystem has risks, and unification will take years

Most of the feedback I got from the HGM session was to simply avoid
adding new code to HugeTLB, and instead to make a new device or
filesystem. Creating a new device or filesystem could work, but it
leaves existing HugeTLB users with no answer for memory poison. Users
would need to switch to the new device/filesystem if they want better
hwpoison handling, and it will probably take years for the new
device/filesystem to support all the features that HugeTLB supports
today (so beyond PUD+ mappings, we would need page table sharing, page
struct freeing, and even private mappings/CoW).

If we make a new filesystem and are unable to completely implement the
HugeTLB uapi exactly with that filesystem, we will be stuck unable to
remove HugeTLB.  We would strongly like to avoid coexisting HugeTLB
implementations (similar to cgroup v1 and cgroup v2) if at all
possible.

Instead of making a new filesystem, we could add HugeTLB-like features
tmpfs, such as support for gigantic page allocations (from bootmem or
CMA, like HugeTLB), for example. This path would work to mostly unify
HugeTLB with tmpfs, but existing HugeTLB users will still have to wait
for many years before poison can be handled more efficiently. (And
some users care about things like hugetlb_cgroup!)

## HGM doesn’t hinder future unification

HGM doesn’t add any new special cases into mm code; it takes advantage
of the existing special cases that already exist to support HugeTLB.
HGM also isn’t adding a completely novel feature that can’t be
replicated by THPs: PTE-mapping of THPs is already supported.

HGM solves a problem that HugeTLB users have right now: unnecessarily
large portions of memory are poisoned. Unless we fix HugeTLB itself,
we will have to spend years effectively rewriting HugeTLB and telling
users to switch to the new system that gets built.

Given all this, I think we should continue to move forward with HGM
unless there is another feasible way to solve poisoning for existing
HugeTLB users. Also, I encourage everyone to read the series itself
(it's not all that complicated!).

- James





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux