On Fri, Jul 15, 2022 at 10:20 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote: > > On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote: > > > > The changes here are very similar to the changes made to > > > > hugetlb_no_page, where we do a high-granularity page table walk and > > > > do accounting slightly differently because we are mapping only a piece > > > > of a page. > > > > > > > > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx> > > > > --- > > > > fs/userfaultfd.c | 3 +++ > > > > include/linux/hugetlb.h | 6 +++-- > > > > mm/hugetlb.c | 54 +++++++++++++++++++++----------------- > > > > mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++---------- > > > > 4 files changed, 82 insertions(+), 38 deletions(-) > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > index e943370107d0..77c1b8a7d0b9 100644 > > > > --- a/fs/userfaultfd.c > > > > +++ b/fs/userfaultfd.c > > > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, > > > > if (!ptep) > > > > goto out; > > > > > > > > + if (hugetlb_hgm_enabled(vma)) > > > > + goto out; > > > > + > > > > > > This is weird. It means we'll never wait for sub-page mapping enabled > > > vmas. Why? > > > > > > > `ret` is true in this case, so we're actually *always* waiting. > > Aha! Then I think that's another problem, sorry. :) See Below. > > > > > > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it > > > means we'll stop waiting for all shared hugetlbfs uffd page faults.. > > > > > > I'd expect in the in-house postcopy tests you should see vcpu threads > > > spinning on the page faults until it's serviced. > > > > > > IMO we still need to properly wait when the pgtable doesn't have the > > > faulted address covered. For sub-page mapping it'll probably need to walk > > > into sub-page levels. > > > > Ok, SGTM. I'll do that for the next version. I'm not sure of the > > consequences of returning `true` here when we should be returning > > `false`. > > We've put ourselves onto the wait queue, if another concurrent > UFFDIO_CONTINUE happened and pte is already installed, I think this thread > could be waiting forever on the next schedule(). > > The solution should be the same - walking the sub-page pgtable would work, > afaict. > > [...] > > > > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > > > > * registered, we firstly wr-protect a none pte which has no page cache > > > > * page backing it, then access the page. > > > > */ > > > > - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) > > > > + if (!hugetlb_pte_none_mostly(dst_hpte)) > > > > goto out_release_unlock; > > > > > > > > - if (vm_shared) { > > > > - page_dup_file_rmap(page, true); > > > > - } else { > > > > - ClearHPageRestoreReserve(page); > > > > - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); > > > > + if (new_mapping) { > > > > > > IIUC you wanted to avoid the mapcount accountings when it's the sub-page > > > that was going to be mapped. > > > > > > Is it a must we get this only from the caller? Can we know we're doing > > > sub-page mapping already here and make a decision with e.g. dst_hpte? > > > > > > It looks weird to me to pass this explicitly from the caller, especially > > > that's when we don't really have the pgtable lock so I'm wondering about > > > possible race conditions too on having stale new_mapping values. > > > > The only way to know what the correct value for `new_mapping` should > > be is to know if we had to change the hstate-level P*D to non-none to > > service this UFFDIO_CONTINUE request. I'll see if there is a nice way > > to do that check in `hugetlb_mcopy_atomic_pte`. > > Right now there is no > > Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or > something alike)? This works in the hugetlb_fault case, because in the hugetlb_fault case, we install the largest PTE possible. If we are mapping a page for the first time, we will use an hstate-sized PTE. But for UFFDIO_CONTINUE, we may be installing a 4K PTE as the first PTE for the whole hpage. > > > race, because we synchronize on the per-hpage mutex. > > Yeah not familiar with that mutex enough to tell, as long as that mutex > guarantees no pgtable update (hmm, then why we need the pgtable lock > here???) then it looks fine. Let me take a closer look at this. I'll have a more detailed explanation for the next version of the RFC. > > [...] > > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > > > copied = 0; > > > > page = NULL; > > > > vma_hpagesize = vma_kernel_pagesize(dst_vma); > > > > + if (use_hgm) > > > > + vma_altpagesize = PAGE_SIZE; > > > > > > Do we need to check the "len" to know whether we should use sub-page > > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will > > > still want the old behavior I think. > > > > I think that's a fair point; however, if we enable HGM and the address > > and len happen to be hstate-aligned > > The address can, but len (note! not "end" here) cannot? They both (dst_start and len) need to be hpage-aligned, otherwise we won't be able to install hstate-sized PTEs. Like if we're installing 4K at the beginning of a 1G hpage, we can't install a PUD, because we only want to install that 4K. > > > , we basically do the same thing as > > if HGM wasn't enabled. It could be a minor performance optimization to > > do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how > > the page tables are set up, the end result would be the same. > > Thanks, Thanks! > > -- > Peter Xu >