Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs

Jason Gunthorpe <jgg@xxxxxxxx> · Tue, 13 Jun 2023 12:17:23 -0300

On Fri, Jun 09, 2023 at 01:20:19PM -0700, James Houghton wrote:

> So, we could:
> 1. Do what HGM does and have the kernel unmap the 4K page in the
> userspace page tables.
> 2. On-the-fly change the VMA for our hugepage to not be HugeTLB
> anymore, and re-map all the good 4K pages.
> 3. Tell userspace that it must change its mapping from HugeTLB to
> something else, and move the good 4K pages into the new mapping.

> (2) feels like more complexity than (1). If a user created a
> MAP_HUGETLB mapping and now it isn't HugeTLB, that feels wrong.
> 
> (3) today isn't possible, but with Jiaqi's improvement to hugetlbfs
> read() it becomes possible. We'll need to have an extra 1G of memory
> while we are doing this copying/recovery, and it isn't transparent at
> all.

It is transparent to the VM, it just has a longer EPT fault response
time if the VM touches that range.

> (3) is additionally painful when considering live migration. We have
> to keep the 4K page unmapped after the migration (to keep it poisoned
> from the guest's perspective), but the page is no longer *actually*
> poisoned on the host. To get the memory we need to back our
> fake-poisoned pages with tmpfs, we would need to free our 1G page.
> Getting that page back later isn't trivial.

Why does this change with #1?

As David says you can't transparently "fix" the page, so when you
migrate a VM with unavailable pages it must migrate those unavailable
pages too, regardless if the kernel made them unavailable or
userspace did.

So, regardless, you end up with a VM that has holes in its address
map.

I guess if the hole is created from a PTE map of a 1G hugetlbfs it is
easier to "heal" back to a full 1G map, but this healing could also be
done by copying.

It seems to me the main value of the kernel-side approach is that it
eliminates the copies and makes the time the 1G page would be
unavailable to the guest shorter.

> So (1) still seems like the most natural solution, so the question
> becomes: how exactly do we implement 4K unmapping? And that brings us
> back to the main question about how HGM should be implemented in
> general.

IMHO if you can do it in userspace with a copy you can solve your
urgent customer need and then have more time to do the big kernel
rework required to optimize it with kernel support.

Jason