On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > James, > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote: > > > Here is the result: [1] (sorry it took a little while heh). The > > Thanks. From what I can tell, that number shows that it'll be great we > start with your rfcv1 mapcount approach, which mimics what's proposed by > Matthew for generic folio. Do you think the RFC v1 way is better than doing the THP-like way *with the additional MMU notifier*? > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this > > Any more information on why it's horrible? :) I figured the code would speak for itself, heh. It's quite complicated. I really didn't like: 1. The 'inc' business in copy_hugetlb_page_range. 2. How/where I call put_page()/folio_put() to keep the refcount and mapcount synced up. 3. Having to check the page cache in UFFDIO_CONTINUE. > > A quick comment is I'm wondering whether that "whether we should boost the > mapcount" value can be hidden in hugetlb_pte* so you don't need to pass > over a lot of bool* deep into the hgm walk routines. Oh yeah, that's a great idea. > > > > implementation probably has bugs anyway; it doesn't account for the > > > folio_referenced() problem). > > I thought we reached a consensus on the resolution, by a proposal to remove > folio_referenced_arg.mapcount. Is it not working for some reason? I think that works, I just didn't bother here. I just wanted to show you approximately what it would look like to implement the RFC v1 approach. > > > > > > > Matthew is trying to solve the same problem with THPs right now: [3]. > > > I haven't figured out how we can apply Matthews's approach to HGM > > > right now, but there probably is a way. (If we left the mapcount > > > increment bits in the same place, we couldn't just check the > > > hstate-level PTE; it would have already been made present.) > > I'm just worried that (1) this may add yet another dependency to your work > which is still during discussion phase, and (2) whether the folio approach > is easily applicable here, e.g., we may not want to populate all the ptes > for hugetlb HGMs by default. That's true. I definitely don't want to wait for this either. It seems like Matthew's approach won't work very well for us -- when doing a lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all the PTEs to see if any of them are mapped would get really slow. > > > > > > > We could: > > > - use the THP-like way and tolerate ~1 second collapses > > > > Another thought here. We don't necessarily *need* to collapse the page > > table mappings in between mmu_notifier_invalidate_range_start() and > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing, > > we aren't punching any holes, and we aren't changing permission bits. > > If we had an MMU notifier that simply informed KVM that we collapsed > > the page tables *after* we finished collapsing, then it would be ok > > for hugetlb_collapse() to be slow. > > That's a great point! It'll definitely apply to either approach. > > > > > If this MMU notifier is something that makes sense, it probably > > applies to MADV_COLLAPSE for THPs as well. > > THPs are definitely different, mmu notifiers should be required there, > afaict. Isn't that what the current code does? > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon. Oh, yes, of course, MADV_COLLAPSE can actually move things around and properly make THPs. Thanks. But it would apply if we were only collapsing PTE-mapped THPs, I think? - James