On Mon, Jun 26, 2023 at 07:47:42PM +0100, Matthew Wilcox wrote: > On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote: > > Upon closer investigation, it was found that in current codebase, lookup_bh_lru > > is slower than __find_get_block_slow: > > > > 114 ns per __find_get_block > > 68 ns per __find_get_block_slow > > > > So remove the per-CPU buffer_head caching. > > LOL. That's amazing. I can't even see why it's so expensive. The > local_irq_disable(), perhaps? Your test case is the best possible > one for lookup_bh_lru() where you're not even doing the copy. Oops, that was due to incorrect buffer size being looked up versus installed size. About 15ns is due to irq disablement. 6ns due to checking preempt is disabled (from __this_cpu_read). So the actual numbers for the single block lookup are (searching for the same block number repeatedly): 42 ns per __find_get_block 68 ns per __find_get_block_slow And increases linearly as the test increases the number of blocks which are searched for: say mod 3 is __find_get_block(blocknr=1) __find_get_block(blocknr=2) __find_get_block(blocknr=3) 41 ns per __find_get_block mod=1 41 ns per __find_get_block mod=2 42 ns per __find_get_block mod=3 43 ns per __find_get_block mod=4 45 ns per __find_get_block mod=5 48 ns per __find_get_block mod=6 48 ns per __find_get_block mod=7 49 ns per __find_get_block mod=8 51 ns per __find_get_block mod=9 52 ns per __find_get_block mod=10 54 ns per __find_get_block mod=11 56 ns per __find_get_block mod=12 58 ns per __find_get_block mod=13 60 ns per __find_get_block mod=14 61 ns per __find_get_block mod=15 63 ns per __find_get_block mod=16 138 ns per __find_get_block mod=17 138 ns per __find_get_block mod=18 138 ns per __find_get_block mod=19 <-- results from first patch, when lookup_bh_lru is a miss 70 ns per __find_get_block_slow mod=1 71 ns per __find_get_block_slow mod=2 71 ns per __find_get_block_slow mod=3 71 ns per __find_get_block_slow mod=4 71 ns per __find_get_block_slow mod=5 72 ns per __find_get_block_slow mod=6 71 ns per __find_get_block_slow mod=7 72 ns per __find_get_block_slow mod=8 71 ns per __find_get_block_slow mod=9 71 ns per __find_get_block_slow mod=10 71 ns per __find_get_block_slow mod=11 71 ns per __find_get_block_slow mod=12 71 ns per __find_get_block_slow mod=13 71 ns per __find_get_block_slow mod=14 71 ns per __find_get_block_slow mod=15 71 ns per __find_get_block_slow mod=16 71 ns per __find_get_block_slow mod=17 72 ns per __find_get_block_slow mod=18 72 ns per __find_get_block_slow mod=19 ls on home directory: hits: 2 misses: 91 find on a linux-2.6 git tree: hits: 25453 misses: 51084 make clean on a linux-2.6 git tree: hits: 247615 misses: 32855 make on a linux-2.6 git tree: hits: 1410414 misses: 166896 In more detail, where each bucket below indicates which index into per-CPU buffer lookup_bh_lru was found: hits idx1 idx2 ... idx16 hits 139506 24299 21597 7462 15790 19108 6477 1349 1237 938 845 636 637 523 431 454 misses: 65773 So i think it makes more sense to just disable the cache for isolated CPUs. > Reviewed-by: Matthew Wilcox (oracle) <willy@xxxxxxxxxxxxx> Thanks.