On 01/19/23 11:42, James Houghton wrote: > On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > On 01/19/23 08:57, James Houghton wrote: > > > > OK, so we do not actually create HGM mappings until a uffd operation is > > done at a less than huge page size granularity. MADV_SPLIT just says > > that HGM mappings are 'possible' for this vma. Hopefully, my understanding > > is correct. > > Right, that's the current meaning of MADV_SPLIT for hugetlb. > > > I was concerned about things like the page fault path, but in that case > > we have already 'entered HGM mode' via a uffd operation. > > > > Both David and Peter have asked whether eliminating intermediate mapping > > levels would be a simplification. I trust your response that it would > > not help much in the current design/implementation. But, it did get me > > thinking about something else. > > > > Perhaps we have discussed this before, and perhaps it does not meet all > > user needs, but one way possibly simplify this is: > > > > - 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate) > > granularity. > > - MADV_SPLIT implicitly unmaps everything with in the range. > > - MADV_SPLIT says all mappings for this vma will now be done at a base > > (4K) page size granularity. vma would be marked some way. > > - I think this eliminates the need for hugetlb_pte's as we KNOW the > > mapping size. > > - We still use huge pages to back 4K mappings, and we still have to deal > > with the ref/map_count issues. > > - Code touching hugetlb page tables would KNOW the mapping size up front. > > > > Again, apologies if we talked about and previously dismissed this type > > of approach. > > I think Peter was the one who originally suggested an approach like > this, and it meets my needs. However, I still think the way that > things are currently implemented is the right way to go. > > Assuming we want decent performance, we can't get away with the same > strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE > should be based on the PMD above the PTE, so we need to either pass > around the PMD above our PTE, or we need to pass around the PTL. This > is something that hugetlb_pte does for us, so, in some sense, even > going with this simpler approach, we still need a hugetlb_pte-like > construct. Agree there is this performance hit. However, the 'simplest' approach would be to just use the page table lock as is done by default for 4K PTEs. I do not know much about the (primary) live migration use case. My guess is that page table lock contention may be an issue? In this use case, HGM is only enabled for the duration the live migration operation, then a MADV_COLLAPSE is performed. If contention is likely to be an issue during this time, then yes we would need to pass around with something like hugetlb_pte. > Although most of the other complexity that HGM introduces would have > to be introduced either way (like having to deal with putting > compound_head()/page_folio() in more places and doing some > per-architecture updates), there are some complexities that the > simpler approach avoids: > > - We avoid problems related to compound PTEs (the problem being: two > threads racing to populate a contiguous and non-contiguous PTE that > take up the same space could lead to user-detectable incorrect > behavior. This isn't hard to fix; it will be when I send the arm64 > patches up.) > > - We don't need to check if PTEs get split from under us in PT walks. > (In a lot of cases, the appropriate action is just to treat the PTE as > if it were pte_none().) In the same vein, we don't need > hugetlb_pte_present_leaf() at all, because PTEs we find will always be > leaves. > > - We don't have to deal with sorting hstates or implementing > for_each_hgm_shift()/hugetlb_alloc_largest_pte(). > > None of these complexities are particularly major in my opinion. Perhaps not. I was just thinking about the overall complexity of the hugetlb code after HGM. Currently, it is 'relatively simple' with fixed huge page sizes. IMO, much simpler than THP with two possible mapping sizes. With HGM and intermediate mapping sizes, it seems things could get more complicated than THP. Perhaps it is just me. I am just too familiar with the current code and a bit anxious about added complexity. But, I felt the same about vmemmap optimizations. :) > This might seem kind of contrived, but let's say you have a VM with 1T > of memory, and you find 100 memory errors all in different 1G pages > over the life of this VM (years, potentially). Having 10% of your > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped > (lost performance and increased memory overhead). There might be other > cases in the future where being able to have intermediate mapping > sizes could be helpful. That may be a bit contrived. We know memory error handling is a future use case, but I believe there is work outside of HGM than needs to be done to handle such situations. For example, HGM will allow the 1G mapping to isolate the 4K page with error. This prevents errors if you fault almost anywhere within the 1G page. But, there still remains the possibility of accessing that 4K page page with error. IMO, it will require user space/application intervention to address this as only the application knows about the potentially lost data. This is still something that needs to be designed. It would then makes sense for the application to also determine how it wants to proceed WRT mapping the 1G area. Perhaps they will want (and there will exist a mechanism) to migrate the data to a new 1G page without error. > > > > I think Peter mentioned it elsewhere, we should come up with a workable > > > > scheme for HGM ref/map counting. This can be done somewhat independently. > > > > > > FWIW, what makes the most sense to me right now is to implement the > > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap > > > optimization. We can later come up with a scheme that lets us retain > > > compatibility. (Is that what you mean by "this can be done somewhat > > > independently", Mike?) > > > > Sort of, I was only saying that getting the ref/map counting right seems > > like a task than can be independently worked. Using the THP-like scheme > > is good. > > Ok! So if you're ok with the intermediate mapping sizes, it sounds > like I should go ahead and implement the THP-like scheme. Yes, I am OK with it. Just expressed a bit of concern above. -- Mike Kravetz