Once the last piece is unmapped (or simpler: once the complete subtree of
page tables is gone), we decrement refcount+mapcount. Might require some
brain power to do this tracking, but I wouldn't call it impossible right
from the start.
Would such a design violate other design aspects that are important?
This is actually how mapcount was treated in HGM RFC v1 (though not
refcount); it is doable for both [2].
One caveat here: if a page is unmapped in small pieces, it is
difficult to know if the page is legitimately completely unmapped (we
would have to check all the PTEs in the page table). In RFC v1, I
sidestepped this caveat by saying that "page_mapcount() is incremented
if the hstate-level PTE is present". A single unmap on the whole
hugepage will clear the hstate-level PTE, thus decrementing the
mapcount.
On a related note, there still exists an (albeit minor) API difference
vs. THPs: a piece of a page that is legitimately unmapped can still
have a positive page_mapcount().
Given that this approach allows us to retain the hugetlb vmemmap
optimization (and it wouldn't require a horrible amount of
complexity), I prefer this approach over the THP-like approach.
If we can store (directly/indirectly) metadata in the highest pgtable
that HGM-maps a hugetlb page, I guess what would be reasonable:
* hugetlb page pointer
* mapped size
Whenever mapping/unmapping sub-parts, we'd have to update that information.
Once "mapped size" dropped to 0, we know that the hugetlb page was
completely unmapped and we can drop the refcount+mapcount, clear
metadata (including hugetlb page pointer) [+ remove the page tables?].
Similarly, once "mapped size" corresponds to the hugetlb size, we can
immediately spot that everything is mapped.
Again, just a high-level idea.
--
Thanks,
David / dhildenb