On Thu, Mar 09, 2023 at 10:05:12AM -0800, James Houghton wrote: > On Wed, Mar 8, 2023 at 2:10 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote: > > > HugeTLB pages may soon support being mapped with PTEs. To allow for this > > > case, merge HugeTLB's mapcount scheme with THP's. > > > > > > The first patch of this series comes from the HugeTLB high-granularity > > > mapping series[1], though with some updates, as the original version > > > was buggy[2] and incomplete. > > > > > > I am sending this change as part of this smaller series in hopes that it > > > can be more thoroughly scrutinized. > > > > > > I haven't run any THP performance tests with this series applied. > > > HugeTLB pages don't currently support being mapped with > > > `compound=false`, but this mapcount scheme will make collapsing > > > compound=false mappings in HugeTLB pages quite slow. This can be > > > optimized with future patches (likely by taking advantage of HugeTLB's > > > alignment guarantees). > > > > > > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid > > > the use of each subpage's mapcount. If this series is applied, Matthew's > > > new scheme will automatically apply to HugeTLB pages. > > > > Is this the plan? > > > > I may have not followed closely on the latest development of Matthew's > > idea. The thing is if the design requires ptes being installed / removed > > at the same time for the whole folio, then it may not work directly for HGM > > if HGM wants to support at least postcopy, iiuc, because if we install the > > whole folio ptes at the same time it seems to beat the whole purpose of > > having HGM.. > > My understanding is that it doesn't *require* all the PTEs in a folio > to be mapped at the same time. I don't see how it possibly could, > given that UFFDIO_CONTINUE exists (which can already create PTE-mapped > THPs today). It would be faster to populate all the PTEs at the same > time (you would only need to traverse the page table once for the > entire group to see if you should be incrementing mapcount). > > Though, with respect to unmapping, if PTEs aren't all unmapped at the > same time, then you could end up with a case where mapcount is still > incremented but nothing is really mapped. I'm not really sure what > should be done there, but this problem applies to PTE-mapped THPs the > same way that it applies to HGMed HugeTLB pages. > > > The patch (especially patch 1) looks good. So it's a pure question just to > > make sure we're on the same page. IIUC your other mapcount proposal may > > work, but it still needs to be able to take care of ptes in less-than-folio > > sizes whatever it'll look like at last. > > By my "other mapcount proposal", I assume you mean the "using the > PAGE_SPECIAL bit to track if mapcount has been incremented or not". It > really only serves as an optimization for Matthew's scheme (see below > [2] for some more thoughts), and it doesn't have to only apply to > HugeTLB. > > I originally thought[1] that Matthew's scheme would be really painful > for postcopy for HGM without this optimization, but it's actually not > so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing > from the end to the beginning, like in [1]: > > First CONTINUE: pvmw finds an empty PUD, so quickly returns false. > Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs, > then finds a present PTE (from the first CONTINUE). > Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs. > ... > 514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs. > > So it'll be slow, but it won't have to check 262k empty PTEs per > CONTINUE (though you could make this possible with MADV_DONTNEED). > Even with an HGM implementation that only allows PTE-mapping of > HugeTLB pages, it should still behave just like this, too. > > > A trivial comment on patch 2 since we're at it: does "a future plan on some > > arch to support 512GB huge page" justify itself? It would be better > > justified, IMHO, when that support is added (and decided to use HGM)? > > That's fine with me. I'm happy to drop that patch. > > > What I feel like is missing (rather than patch 2 itself) is some guard to > > make sure thp mapcountings will not be abused with new hugetlb sizes > > coming. > > > > How about another BUG_ON() squashed into patch 1 (probably somewhere in > > page_add_file|anon_rmap()) to make sure folio_size() is always smaller than > > COMPOUND_MAPPED / 2)? > > Sure, I can add that. > > Thanks, Peter! > > - James > > [1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_nqHF0k-pqPi52w@xxxxxxxxxxxxxx/ > > [2]: Some details on what the optimization might look like: > > So an excerpt of Matthew's scheme would look something like this: > > /* if we're mapping < folio_nr_pages(folio) worth of PTEs. */ > if (!folio_has_ptes(folio, vma)) > atomic_inc(folio->_mapcount); > > where folio_has_ptes() is defined like: > > if (!page_vma_mapped_walk(...)) > return false; > page_vma_mapped_walk_done(...); > return true; > > You might be able to optimize folio_has_ptes() with a block like this > at the beginning: > > if (folio_is_naturally_aligned(folio, vma)) { > /* optimization for naturally-aligned folios. */ > if (folio_test_hugetlb(folio)) { > /* check hstate-level PTE, and do a similar check as below. */ > } > /* for naturally-aligned THPs: */ > pmdp = mm_find_pmd(...); /* or just pass it in. */ > pmd = READ_ONCE(*pmdp); > BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd)); > if (pmd_special(pmd)) > return true; > /* we already hold the PTL for the PTE. */ > ptl = pmd_lock(mm, pmdp); > /* test and set pmd_special */ > pmd_unlock(ptl) > return if_we_set_pmd_special; > } > > (pmd_special() doesn't currently exist.) If HugeTLB walking code can > be merged with generic mm, then HugeTLB wouldn't have a special case > at all here. I see what you mean now, thanks. That looks fine. I just suspect the pte_special trick will still be needed if this will start to apply to HGM, as it seems to not suite perfectly with a large folio size, still. The MADV_DONTNEED worst case of having it loop over ~folio_size() times of none pte is still possible. -- Peter Xu