Re: [RFC PATCH] mm: populate multiple PTEs if file page is large folio

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 17 Jan 2023 14:46:43 +0000

On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote:
> On 17.01.23 10:19, Yin, Fengwei wrote:
> > 
> > 
> > On 1/14/2023 2:13 AM, Matthew Wilcox wrote:
> > > On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote:
> > > > The page fault number can be reduced by batched PTEs population.
> > > > The batch size of PTEs population is not allowed to cross:
> > > >    - page table boundaries
> > > >    - vma range
> > > >    - large folio size
> > > >    - fault_around_bytes
> > > 
> > > I find this patch very interesting.  But is it really worth it?  Most
> > > file-backed page faults are resolved through the ->map_pages() path
> > > which is almost always filemap_map_pages(), which does something
> > > fairly similar to this already.  Do you have any performance numbers?
> > > 
> > I tried the will-it-scale page_fault3:
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
> > with 96 processes on a test box with 48C/86T.
> > 
> > The test result got about 3.75X better with 4.1X less page fault number
> > with this patch.
> > 
> > But It's a micro benchmark which shows extreme friendly case to this patch.
> > 
> > I didn't see observed performance gain with other workloads. I suppose
> > shared file write operations may not be common operations? Thanks.
> 
> One question I have after reading "which does something fairly similar to
> this already", if both paths could be unified.

I've been thinking about this already; not so much in terms of "unifying
these two implementations" but rather "What's the right API for mapping
the parts of a folio that fit into userspace in response to a fault".
I haven't got quite as far as drafting code, but I'm thinking there should
be an API where we pass in the vmf, a folio and some other information,
and that function takes care of mapping however many pages from this
folio that it can.

And the reason to split it up like that is to batch as many page
operations into this folio as possible.  eg filemap_map_pages() was
written "to get it working", not "to be efficient", so it does stupid
things like call folio_ref_inc() every time it maps a page instead of
counting how many pages it maps and calling folio_ref_add() at the end.
Similar optimisations should be done for mapcount, which implies some
kind of batched equivalent of page_add_*_rmap().