On 12/21/22 19:02, James Houghton wrote: > On Wed, Dec 21, 2022 at 5:32 PM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > > > On 12/21/22 17:10, Peter Xu wrote: > > > On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote: > > > > On 12/21/22 15:21, James Houghton wrote: > > > > > Thanks for bringing this up, Peter. I think the main reason was: > > > > > having separate UFFD_FEATUREs clearly indicates to userspace what is > > > > > and is not supported. > > > > > > > > IIRC, I think we wanted to initially limit the usage to the very > > > > specific use case (live migration). The idea is that we could then > > > > expand usage as more use cases came to light. > > > > > > > > Another good thing is that userfaultfd has versioning built into the > > > > API. Thus a user can determine if HGM is enabled in their running > > > > kernel. > > > > > > I don't worry much on this one, afaiu if we have any way to enable hgm then > > > the user can just try enabling it on a test vma, just like when an app > > > wants to detect whether a new madvise() is present on the current host OS. > > That would be enough to test if HGM was merely present, but if > specific features like 4K UFFDIO_CONTINUEs or 4K UFFDIO_WRITEPROTECTs > were available. You could always check these by making a HugeTLB VMA > and setting it up correctly for userfaultfd/etc., but that's a little > messy. > > > > > > > Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm > > > would work too. > > I'm not opposed to this. > > > > > > > > > > > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller > > > > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't > > > > > allowed as of this patch series, but it could be allowed in the > > > > > future. To add support in the same way as this series, we would add > > > > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that > > > > > having to add another feature isn't great; is this what you're > > > > > concerned about? > > > > > > > > > > Considering MADV_ENABLE_HUGETLB... > > > > > 1. If a user provides this, then the contract becomes: "the kernel may > > > > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at > > > > > high-granularities, provided the support exists", but it becomes > > > > > unclear to userspace to know what's supported and what isn't. > > > > > 2. We would then need to keep track if a user explicitly enabled it, > > > > > or if it got enabled automatically in response to memory poison, for > > > > > example. Not a big problem, just a complication. (Otherwise, if HGM > > > > > got enabled for poison, suddenly userspace would be allowed to do > > > > > things it wasn't allowed to do before.) > > > > > > We could alternatively have two flags for each vma: (a) hgm_advised and (b) > > > hgm_enabled. (a) always sets (b) but not vice versa. We can limit poison > > > to set (b) only. For this patchset, it can be all about (a). > > My thoughts exactly. :) > > > > > > > > > 3. This API makes sense for enabling HGM for something outside of > > > > > userfaultfd, like MADV_DONTNEED. > > > > > > > > I think #3 is key here. Once we start applying HGM to things outside > > > > userfaultfd, then more thought will be required on APIs. The API is > > > > somewhat limited by design until the basic functionality is in place. > > > > > > Mike, could you elaborate what's the major concern of having hgm used > > > outside uffd and live migration use cases? > > > > > > I feel like I miss something here. I can understand we want to limit the > > > usage only when the user specifies using hgm because we want to keep the > > > old behavior intact. However if we want another way to enable hgm it'll > > > still need one knob anyway even outside uffd, and I thought that'll service > > > the same purpose, or maybe not? > > > > I am not opposed to using hgm outside the use cases targeted by this series. > > > > It seems that when we were previously discussing the API we spent a bunch of > > time going around in circles trying to get the API correct. That is expected > > as it is more difficult to take all users/uses/abuses of the API into account. > > > > Since the initial use case was fairly limited, it seemed like a good idea to > > limit the API to userfaultfd. In this way we could focus on the underlying > > code/implementation and then expand as needed. Of course, with an eye on > > anything that may be a limiting factor in the future. > > > > I was not aware of the uffd-wp use case, and am more than happy to discuss > > expanding the API. > > So considering two API choices: > > 1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for > UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for > UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start > allowing high-granularity choices (not sure if this is bad; we started > allowing it for HugeTLB recently with no other API change, AFAIA). I don't think we can just start allowing HGM for MADV_DONTNEED without some type of user interaction/request. Otherwise, a user that passes in non-hugetlb page size requests may get unexpected results. And, one of the threads about MADV_DONTNEED points out a valid use cases where the caller may not know the mapping is hugetlb or not and is likely to pass in non-hugetlb page size requests. > 2. MADV_ENABLE_HGM or something similar. The changes to > UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically, > provided they are implemented. > > I don't mind one way or the other. Peter, I assume you prefer #2. > Mike, what about you? If we decide on something other than #1, I'll > make the change before sending v1 out. Since I do not believe 1) is an option, MADV_ENABLE_HGM might be the way to go. Any thoughts about MADV_ENABLE_HGM? I'm thinking: - Make it have same restrictions as other madvise hugetlb calls, . addr must be huge page aligned . length is rounded down to a multiple of huge page size - We split the vma as required - Flags carrying HGM state reside in the hugetlb_shared_vma_data struct -- Mike Kravetz