On Fri, Jun 09, 2023 at 01:20:19PM -0700, James Houghton wrote: > So, we could: > 1. Do what HGM does and have the kernel unmap the 4K page in the > userspace page tables. > 2. On-the-fly change the VMA for our hugepage to not be HugeTLB > anymore, and re-map all the good 4K pages. > 3. Tell userspace that it must change its mapping from HugeTLB to > something else, and move the good 4K pages into the new mapping. > (2) feels like more complexity than (1). If a user created a > MAP_HUGETLB mapping and now it isn't HugeTLB, that feels wrong. > > (3) today isn't possible, but with Jiaqi's improvement to hugetlbfs > read() it becomes possible. We'll need to have an extra 1G of memory > while we are doing this copying/recovery, and it isn't transparent at > all. It is transparent to the VM, it just has a longer EPT fault response time if the VM touches that range. > (3) is additionally painful when considering live migration. We have > to keep the 4K page unmapped after the migration (to keep it poisoned > from the guest's perspective), but the page is no longer *actually* > poisoned on the host. To get the memory we need to back our > fake-poisoned pages with tmpfs, we would need to free our 1G page. > Getting that page back later isn't trivial. Why does this change with #1? As David says you can't transparently "fix" the page, so when you migrate a VM with unavailable pages it must migrate those unavailable pages too, regardless if the kernel made them unavailable or userspace did. So, regardless, you end up with a VM that has holes in its address map. I guess if the hole is created from a PTE map of a 1G hugetlbfs it is easier to "heal" back to a full 1G map, but this healing could also be done by copying. It seems to me the main value of the kernel-side approach is that it eliminates the copies and makes the time the 1G page would be unavailable to the guest shorter. > So (1) still seems like the most natural solution, so the question > becomes: how exactly do we implement 4K unmapping? And that brings us > back to the main question about how HGM should be implemented in > general. IMHO if you can do it in userspace with a copy you can solve your urgent customer need and then have more time to do the big kernel rework required to optimize it with kernel support. Jason