On Mon, Mar 11, 2024 at 01:12:45PM +1100, Dave Chinner wrote: > > Batch size Cost of allocating 100 thousand million > > 1 500 (5 * 100) 5000 5M > > 2 300 (6 * 50) 3000 3M > > 4 200 (8 * 25) 2000 2M > > 8 156 (12 * 13) 1500 1.5M > > 16 140 (20 * 7) 1260 1.25M > > 32 144 (36 * 4) 1152 1.13M > > 64 136 (68 * 2) 1088 1.06M > > 128 132 (132 * 1) 1056 1.03M > > Isn't this just repeating the fundamental observation that SLUB is > based on? i.e. it can use high-order pages so that it can > pre-allocate optimally sized batches of objects regardless of their > size? i.e. it tries to size the backing page order to allocate in > chunks of 30-40 objects at a time? What SLUB is currently doing is inefficient. One of the conversations I had (off-list) about appropriate batch size is in relation to SLUB-ng where one of the participants claimed that the batch size of 32 was obviously too small (it wasn't; the performance problem was due to a bug). What you're thinking about is the cost of allocating from the page allocator (which is now much cheaper than it used to be, but should be cheaper than it currently is). But there is another inefficiency to consider, which is that the slab allocator has a per-slab lock, and while you can efficiently remove and add a number of objects to a single slab, you might only have one or two free objects per slab. To work around this some of the more performance sensitive parts of the kernel have implemented their own allocator in front of slab. This is clearly a bad thing for all of us, and hence Vlastimil has been working on a better approach. https://lore.kernel.org/linux-mm/20231129-slub-percpu-caches-v3-0-6bcf536772bc@xxxxxxx/ > Except for SLUB we're actually allocating in the hundreds of > millions to billions of objects on machines with TBs of RAM. IOWs we > really want to be much further down the curve than 8 - batches of at > least 32-64 have significantly lower cost and that matters when > scaling to (and beyond) hundreds of millions of objects.... But that doesn't necessarily mean that you want a larger batch size. Because you're not just allocating, you're also freeing and over a large enough timescale the number of objects allocated and freed is approximately equal. In the SLUB case, your batch size needs to be large enough to absorb most of the allcation-vs-free bias jitter; that is if you know they always alternate AFAFAFAFAF a batch size of 2 would be fine. If you know you get four allocations followed by four frees, having a batch size of 5 woud be fine. We'd never go to the parent allocator if we got a AFAAFFAAAFFFAAAAFFFFAAFFAFAAFAAFFF pattern. > > This is a simple model for only one situation. If we have a locking > > contention breakdown, the overhead cost might be much higher than 4 units, > > and that would lead us to a larger batch size. > > > > Another consideration is how much of each object we have to touch. > > put_pages_list() is frequently called with batches of 500 pages. In order > > to free a folio, we have to manipulate its contents, so touching at least > > one cacheline per object. > > Right, that's simply the cost of the batch cache footprint issue > rather than a "fixed cost mitigation" described for allocation. No, it's not, it's an illustration that too large a batch size can actively harm you. > So I'm not sure what you're trying to say here? We've known about > these batch optimisation considerations for a long, long time and > that batch size optimisation is always algorithm and access pattern > dependent, so.... ??? People forget these "things we've always known". I went looking and couldn't find a good writeup of this, so did my own. In addition to the percpu slub batch size, various people have opined that the folio_batch size (15 objects) is too small for doing things like writeback and readahead. They're going to have to bring data to convince me. > > And we make multiple passes over the batch, > > first decrementing the refcount, removing it from the lru list; second > > uncharging the folios from the memcg (writes to folio->memcg_data); > > third calling free_pages_prepare which, eg, sets ->mapping to NULL; > > fourth putting the folio on the pcp list (writing to the list_head). > > Sounds like "batch cache footprint" would be reduced by inverting > that algorithm and doing all the work on a single object in a single > pass, rahter than doing it in multiple passes. That way the cache > footprint of the batch is determined entirely by the size of the > data structures accessed to process each object in the batch. Well, now you're just opining without having studied the problem, and I have, so I can say confidently that you're wrong. You could read the code if you like.