On Tue, Jan 24, 2023 at 10:13 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > Once we get to the part of the folio journey where we have > one-pointer-per-page, we can't afford to maintain per-page state. > Currently we maintain a per-page mapcount, and that will have to go. > We can maintain extra state for a multi-page folio, but it has to be a > constant amount of extra state no matter how many pages are in the folio. > > My proposal is that we maintain a single mapcount per folio, and its > definition is the number of (vma, page table) tuples which have a > reference to any pages in this folio. > > I think there's a good performance win and simplification to be had > here, so I think it's worth doing for 6.4. > > Examples > -------- > > In the simple and common case where every page in a folio is mapped > once by a single vma and single page table, mapcount would be 1 [1]. > If the folio is mapped across a page table boundary by a single VMA, > after we take a page fault on it in one page table, it gets a mapcount > of 1. After taking a page fault on it in the other page table, its > mapcount increases to 2. > > For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the > PMD into PTEs would not change the mapcount; the folio remains order-9 > but it stll has a reference from only one page table (a different page > table, but still just one). > > Implementation sketch > --------------------- > > When we take a page fault, we can/should map every page in the folio > that fits in this VMA and this page table. We do this at present in > filemap_map_pages() by looping over each page in the folio and calling > do_set_pte() on each. We should have a: > > do_set_pte_range(vmf, folio, addr, first_page, n); > > and then change the API to page_add_new_anon_rmap() / page_add_file_rmap() > to pass in (folio, first, n) instead of page. That gives us one call to > page_add_*_rmap() per (vma, page table) tuple. > > In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for > each pfn. We'll want a function like > page_vma_mapped_walk_skip_to_end_of_ptable() > in order to persuade it to only call us once or twice if the folio > is mapped across a page table boundary. > > Concerns > -------- > > We'll have to be careful to always zap all the PTEs for a given (vma, > pt) tuple at the same time, otherwise mapcount will get out of sync > (eg map three pages, unmap two; we shouldn't decrement the mapcount, > but I don't think we can know that. But does this ever happen? I think > we always unmap the entire folio, like in try_to_unmap_one(). Off the top of my head, MADV_DONTNEED may unmap the folio partially, but keep the folio unsplit until some point, for example, memory pressure. munmap() should be able to unmap a folio partially as well. > > I haven't got my head around SetPageAnonExclusive() yet. I think it can > be a per-folio bit, but handling a folio split across two page tables > may be tricky. > > Notes > ----- > > [1] Ignoring the bias by -1 to let us detect transitions that we care > about more efficiently; I'm talking about the value returned from > page_mapcount(), not the value stored in page->_mapcount. > >