On Wed, Nov 04, 2020 at 04:30:05PM -0500, Kent Overstreet wrote: > On Wed, Nov 04, 2020 at 08:42:03PM +0000, Matthew Wilcox (Oracle) wrote: > > Increasing the batch size runs into diminishing returns. It's probably > > better to make, eg, three calls to filemap_get_pages() than it is to > > call into kmalloc(). > > I have to disagree. Working with PAGEVEC_SIZE pages is eventually going to be > like working with 4k pages today, and have you actually read the slub code for > the kmalloc fast path? It's _really_ fast, there's no atomic operations and it > doesn't even have to disable preemption - which is why you never see it showing > up in profiles ever since we switched to slub. I've been puzzling over this, and trying to run some benchmarks to figure it out. My test VM is too noisy though; the error bars are too large to get solid data. There are three reasons why I think we hit diminishing returns: 1. Cost of going into the slab allocator (one alloc, one free). Maybe that's not as high as I think it is. 2. Let's say the per-page overhead of walking i_pages is 10% of the CPU time for a 128kB I/O with a batch size of 1. Increasing the batch size to 15 means we walk the array 3 times instead of 32 times, or 0.7% of the CPU time -- total reduction in CPU time of 9.3%. Increasing the batch size to 32 means we only walk the array once, which cuts it down from 10% to 0.3% -- reduction in CPU time of 9.7%. If we are doing 2MB I/Os (and most applications I've looked at recently only do 128kB), and the 10% remains constant, then the batch-size-15 case walks the tree 17 times instead of 512 times -- 0.6%, whereas the batch-size-512 case walks the tree once -- 0.02%. But that only loks like an overall savings of 9.98% versus 9.4%. And is an extra 0.6% saving worth it? 3. By the time we're doing such large I/Os, we're surely dominated by memcpy() and not walking the tree. Even if the file you're working on is a terabyte in size, the radix tree is only 5 layers deep. So that's five pointer dereferences to find the struct page, and they should stay in cache (maybe they'd fall out to L2, but surely not as far as L3). And generally radix tree cachelines stay clean so there shouldn't be any contention on them from other CPUs unless they're dirtying the pages or writing them back.