On Mon, Jun 26, 2023 at 07:47:42PM +0100, Matthew Wilcox wrote: > On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote: > > Upon closer investigation, it was found that in current codebase, lookup_bh_lru > > is slower than __find_get_block_slow: > > > > 114 ns per __find_get_block > > 68 ns per __find_get_block_slow > > > > So remove the per-CPU buffer_head caching. > > LOL. That's amazing. I can't even see why it's so expensive. The > local_irq_disable(), perhaps? Your test case is the best possible > one for lookup_bh_lru() where you're not even doing the copy. I think it's even simpler than that. i.e. the lookaside cache is being missed, so it's a pure cost and the code is always having to call __find_get_block_slow() anyway. Peeking at 16 buffers to not find a match is just as expensive as walking 3-4 tree levels in an Xarray to find the buffer in the first place.... IMO, this is an example of how lookaside caches are only a benefit if the working set of items largely fits in the lookaside cache and the cache lookup itself is much, much slower than a lookaside cache miss. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx