Re: [PATCH v2 02/18] mm/filemap: Remove dynamically allocated array from filemap_read

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 5 Nov 2020 04:52:14 +0000

On Wed, Nov 04, 2020 at 04:30:05PM -0500, Kent Overstreet wrote:
> On Wed, Nov 04, 2020 at 08:42:03PM +0000, Matthew Wilcox (Oracle) wrote:
> > Increasing the batch size runs into diminishing returns.  It's probably
> > better to make, eg, three calls to filemap_get_pages() than it is to
> > call into kmalloc().
> 
> I have to disagree. Working with PAGEVEC_SIZE pages is eventually going to be
> like working with 4k pages today, and have you actually read the slub code for
> the kmalloc fast path? It's _really_ fast, there's no atomic operations and it
> doesn't even have to disable preemption - which is why you never see it showing
> up in profiles ever since we switched to slub.

I've been puzzling over this, and trying to run some benchmarks to figure
it out.  My test VM is too noisy though; the error bars are too large to
get solid data.

There are three reasons why I think we hit diminishing returns:

1. Cost of going into the slab allocator (one alloc, one free).
Maybe that's not as high as I think it is.

2. Let's say the per-page overhead of walking i_pages is 10% of the
CPU time for a 128kB I/O with a batch size of 1.  Increasing the batch
size to 15 means we walk the array 3 times instead of 32 times, or 0.7%
of the CPU time -- total reduction in CPU time of 9.3%.  Increasing the
batch size to 32 means we only walk the array once, which cuts it down
from 10% to 0.3% -- reduction in CPU time of 9.7%.

If we are doing 2MB I/Os (and most applications I've looked at recently
only do 128kB), and the 10% remains constant, then the batch-size-15
case walks the tree 17 times instead of 512 times -- 0.6%, whereas the
batch-size-512 case walks the tree once -- 0.02%.  But that only loks
like an overall savings of 9.98% versus 9.4%.  And is an extra 0.6%
saving worth it?

3. By the time we're doing such large I/Os, we're surely dominated by
memcpy() and not walking the tree.  Even if the file you're working on
is a terabyte in size, the radix tree is only 5 layers deep.  So that's
five pointer dereferences to find the struct page, and they should stay
in cache (maybe they'd fall out to L2, but surely not as far as L3).
And generally radix tree cachelines stay clean so there shouldn't be any
contention on them from other CPUs unless they're dirtying the pages or
writing them back.