Re: Folio mapcount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 07, 2023 at 05:39:07PM -0500, Peter Xu wrote:
> On Mon, Feb 06, 2023 at 08:34:31PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > > Once we get to the part of the folio journey where we have 
> > > one-pointer-per-page, we can't afford to maintain per-page state.
> > > Currently we maintain a per-page mapcount, and that will have to go. 
> > > We can maintain extra state for a multi-page folio, but it has to be a
> > > constant amount of extra state no matter how many pages are in the folio.
> > > 
> > > My proposal is that we maintain a single mapcount per folio, and its
> > > definition is the number of (vma, page table) tuples which have a
> > > reference to any pages in this folio.
> > 
> > I've been thinking about this a lot more, and I have changed my
> > mind.  It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?"  That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
> 
> I'm curious whether it is still fine in rare cases - IMHO it's a matter of
> when it'll go severely wrong if the mapcount should be exactly 1 (it's
> privately owned by a vma) but we reported 2.
> 
> In this MADV_COLD/MADV_PAGEOUT case we'll skip COLD or PAGEOUT some pages
> even if we can, but is it a deal breaker (if the benefit of the change can
> be proved and worthwhile)?  Especially, this only happens with unaligned
> folios being mapped.
> 
> Is unaligned mapping for a folio common? Is there any other use cases that
> can go worse than this one?

For file pages, I think it can go wrong rather more often than we might
like.  I think for anon memory, we'll tend to allocate it to be aligned,
and then it takes some weirdness like mremap() to make it unaligned.

But I'm just waving my hands wildly.  I don't really know.

> (E.g., IIUC superfluous but occasional CoW seems fine)
> 
> OTOH...
> 
> > 
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out.  Not fun times.
> > 
> > (*) If we support folios larger than PMD size, then the answer is more
> > complex.
> > 
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
> > 
> > That means that our future folio_add_file_rmap_range() looks a bit
> > like this:
> > 
> > {
> > 	bool add_mapcount = true;
> > 
> > 	if (nr < folio_nr_pages(folio))
> > 		add_mapcount = !folio_has_ptes(folio, vma);
> > 	if (add_mapcount)
> > 		atomic_inc(&folio->_mapcount);
> > 
> > 	__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> > 	if (nr == HPAGE_PMD_NR)
> > 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> > 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
> > 
> > 	mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> > }
> > 
> > bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> > {
> > 	unsigned long address = vma_address(&folio->page, vma);
> > 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > 
> > 	if (!page_vma_mapped_walk(&pvmw))
> > 		return false;
> > 	page_vma_mapped_walk_done(&pvmw);
> > 	return true;
> > }
> > 
> > ... some details to be fixed here; particularly this will currently
> > deadlock on the PTL, so we'd need not only to exclude the current
> > PMD from being examined, but also avoid a deadly embrace between
> > two threads (do we currently have a locking order defined for
> > page table locks at the same height of the tree?)
> 
> ... it starts to sound scary if it needs to take >1 pgtable locks.

I've been thinking about this one, and I wonder if we can do it
without taking any pgtable locks.  The locking environment we're in
is the page fault handler, so we have the mmap_lock for read (for now
anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
those entries can't disappear under us.  They also can't appear under
us.  We hold the PTL on one PMD, but not necessarily on any other PMD
we examine.

I appreciate that PTEs can _change_ under us if we do not hold the PTL,
but by virtue of holding the folio lock, they can't change from or to
our PFNs.  I also think the PMD table cannot disappear under us
since we're holding the mmap_lock for read, and anyone removing page
tables has to take the mmap_lock for write.

Am I missing anything important?




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux