On 01/18/23 08:39, James Houghton wrote: > > > > To deal with this, the best solution we've been able to come up with > > > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()), > > > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault) > > > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the > > > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't > > > > want to kill a random process). > > > > > > You'd have to also make sure that fork() won't do the same. At least with > > > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED > > > mappings, which would have to be handled. > > Indeed, thank you! copy_hugetlb_page_range() (and therefore fork() in > some cases) would need this check too. > > > > > If we want such a check to make a real difference, IIUC we may want to > > consider having similar check in: > > > > page_ref_add > > page_ref_inc > > page_ref_inc_return > > page_ref_add_unless > > > > But it's unfortunate that mostly all the callers to these functions > > (especially the first two) do not have a retval yet at all. Considering > > the low possibility so far on having it overflow, maybe it can also be done > > for later (and I think checking negative as try_get_page would suffice too). > > I think as long as we annotate the few cases that userspace can > exploit to overflow refcount, we should be ok. I think this was the > same idea with try_get_page(): it is supposed to be used in places > that userspace could reasonably exploit to overflow refcount. > > > > > > > > > Of course, one can just disallow fork() with any HGM right from the start > > > and keep it all simpler to not open up a can of worms there. > > > > > > Is it reasonable, to have more than one (or a handful) of VMAs mapping a > > > huge page via a HGM? Restricting it to a single one, would make handling > > > much easier. > > > > > > If there is ever demand for more HGM mappings, that whole problem (and > > > complexity) could be dealt with later. ... but I assume it will already be a > > > requirement for VMs (e.g., under QEMU) that share memory with other > > > processes (virtiofsd and friends?) > > > > Yes, any form of multi-proc QEMU will need that for supporting HGM > > postcopy. > > +1. Disallowing fork() entirely is quite a restrictive limitation. > > [snip] > > > > I'm curious what others think (Mike, Matthew?). I'm guessing the > > > > THP-like way is probably what most people would want, though it would > > > > be a real shame to lose the vmemmap optimization. > > > > > > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and > > > if we're dealing with refcount overflows already, mapcount overflows are not > > > an issue. > > > > > > > > > I wonder if the following crazy idea has already been discussed: treat the > > > whole mapping as a single large logical mapping. One reference and one > > > mapping, no matter how the individual parts are mapped into the assigned > > > page table sub-tree. > > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned > > > sub-tree of page tables can only map the given hugetlb page, no fragments of > > > something else. That's very different to THP in private mappings ... > > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount. > > > Other pieces in the same subtree don't do that. > > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of > > > page tables is gone), we decrement refcount+mapcount. Might require some > > > brain power to do this tracking, but I wouldn't call it impossible right > > > from the start. > > > > > > Would such a design violate other design aspects that are important? > > This is actually how mapcount was treated in HGM RFC v1 (though not > refcount); it is doable for both [2]. My apologies for being late to the party :) When Peter first brought up the issue with ref/map_count overflows I was thinking that we should use a scheme like David describes above. As James points out, this was the approach taken in the first RFC. > One caveat here: if a page is unmapped in small pieces, it is > difficult to know if the page is legitimately completely unmapped (we > would have to check all the PTEs in the page table). Are we allowing unmapping of small (non-huge page sized) areas with HGM? We must be if you are concerned with it. What API would cause this? I just do not remember this discussion. > would have to check all the PTEs in the page table). In RFC v1, I > sidestepped this caveat by saying that "page_mapcount() is incremented > if the hstate-level PTE is present". A single unmap on the whole > hugepage will clear the hstate-level PTE, thus decrementing the > mapcount. > > On a related note, there still exists an (albeit minor) API difference > vs. THPs: a piece of a page that is legitimately unmapped can still > have a positive page_mapcount(). > > Given that this approach allows us to retain the hugetlb vmemmap > optimization (and it wouldn't require a horrible amount of > complexity), I prefer this approach over the THP-like approach. Me too. > > > > The question is how to maintaining above information. > > > > It needs to be per-map (so one page mapped multiple times can be accounted > > differently), and per-page (so one mapping/vma can contain multiple pages). > > So far I think that's exactly the pgtable. If we can squeeze information > > into the pgtable it'll work out, but definitely not trivial. Or we can > > maintain seperate allocates for such information, but that can be extra > > overheads too. > > I don't think we necessarily need to check the page table if we allow > for the limitations stated above. > When I was thinking about this I was a bit concerned about having enough information to know exactly when to inc or dec counts. I was actually worried about knowing to do the increment. I don't recall how it was done in the first RFC, but from a high level it would need to be done when the first hstate level PTE is allocated/added to the page table. Right? My concern was with all the places where we could 'error out' after allocating the PTE, but before initializing it. I was just thinking that we might need to scan the page table or keep metadata for better or easier accounting. I think Peter mentioned it elsewhere, we should come up with a workable scheme for HGM ref/map counting. This can be done somewhat independently. -- Mike Kravetz