On 01/19/23 08:57, James Houghton wrote: > > > > > I wonder if the following crazy idea has already been discussed: treat the > > > > > whole mapping as a single large logical mapping. One reference and one > > > > > mapping, no matter how the individual parts are mapped into the assigned > > > > > page table sub-tree. > > > > > > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of > > > > > something else. That's very different to THP in private mappings ... > > > > > > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > > > > > Other pieces in the same subtree don't do that. > > > > > > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of > > > > > page tables is gone), we decrement refcount+mapcount. Might require some > > > > > brain power to do this tracking, but I wouldn't call it impossible right > > > > > from the start. > > > > > > > > > > Would such a design violate other design aspects that are important? > > > > > > This is actually how mapcount was treated in HGM RFC v1 (though not > > > refcount); it is doable for both [2]. > > > > My apologies for being late to the party :) > > > > When Peter first brought up the issue with ref/map_count overflows I was > > thinking that we should use a scheme like David describes above. As > > James points out, this was the approach taken in the first RFC. > > > > > One caveat here: if a page is unmapped in small pieces, it is > > > difficult to know if the page is legitimately completely unmapped (we > > > would have to check all the PTEs in the page table). > > > > Are we allowing unmapping of small (non-huge page sized) areas with HGM? > > We must be if you are concerned with it. What API would cause this? > > I just do not remember this discussion. > > There was some discussion about allowing MADV_DONTNEED on > less-than-hugepage pieces [3] (it actually motivated the switch from > UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented > in this series, but it could be implemented in the future. OK, so we do not actually create HGM mappings until a uffd operation is done at a less than huge page size granularity. MADV_SPLIT just says that HGM mappings are 'possible' for this vma. Hopefully, my understanding is correct. I was concerned about things like the page fault path, but in that case we have already 'entered HGM mode' via a uffd operation. Both David and Peter have asked whether eliminating intermediate mapping levels would be a simplification. I trust your response that it would not help much in the current design/implementation. But, it did get me thinking about something else. Perhaps we have discussed this before, and perhaps it does not meet all user needs, but one way possibly simplify this is: - 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate) granularity. - MADV_SPLIT implicitly unmaps everything with in the range. - MADV_SPLIT says all mappings for this vma will now be done at a base (4K) page size granularity. vma would be marked some way. - I think this eliminates the need for hugetlb_pte's as we KNOW the mapping size. - We still use huge pages to back 4K mappings, and we still have to deal with the ref/map_count issues. - Code touching hugetlb page tables would KNOW the mapping size up front. Again, apologies if we talked about and previously dismissed this type of approach. > > When I was thinking about this I was a bit concerned about having enough > > information to know exactly when to inc or dec counts. I was actually > > worried about knowing to do the increment. I don't recall how it was > > done in the first RFC, but from a high level it would need to be done > > when the first hstate level PTE is allocated/added to the page table. > > Right? My concern was with all the places where we could 'error out' > > after allocating the PTE, but before initializing it. I was just thinking > > that we might need to scan the page table or keep metadata for better > > or easier accounting. > > The only two places where we can *create* a high-granularity page > table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and > copy_hugetlb_page_range. RFC v1 did not properly deal with the cases > where we error out. To correctly handle these cases, we basically have > to do the pagecache lookup before touching the page table. > > 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the > PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do > so immediately. We can easily pass the page into > hugetlb_mcopy_atomic_pte() (via 'pagep') . > > 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the > lookup before we do the page table walk. I'm not sure how to support > non-shared HGM mappings with this scheme (in this series, we also > don't support non-shared; we return -EINVAL). > NB: The only case where high-granularity mappings for !VM_MAYSHARE > VMAs would come up is as a result of hwpoison. > > So we can avoid keeping additional metadata for what this series is > trying to accomplish, but if the above isn't acceptable, then I/we can > try to come up with a scheme that would be acceptable. Ok, I was thinking we had to deal with other code paths such as page fault. But, now I understand that is not the case with this design. > There is also the possibility that the scheme implemented in this > version of the series is acceptable (i.e., the page_mapcount() API > difference, which results in slightly modified page migration behavior > and smaps output, is ok... assuming we have the refcount overflow > check). > > > > > I think Peter mentioned it elsewhere, we should come up with a workable > > scheme for HGM ref/map counting. This can be done somewhat independently. > > FWIW, what makes the most sense to me right now is to implement the > THP-like scheme and mark HGM as mutually exclusive with the vmemmap > optimization. We can later come up with a scheme that lets us retain > compatibility. (Is that what you mean by "this can be done somewhat > independently", Mike?) Sort of, I was only saying that getting the ref/map counting right seems like a task than can be independently worked. Using the THP-like scheme is good. -- Mike Kravetz