On Tue, Jun 27, 2023 at 01:13:25AM +0100, Matthew Wilcox wrote: > On Tue, Jun 27, 2023 at 09:30:09AM +1000, Dave Chinner wrote: > > On Mon, Jun 26, 2023 at 07:47:42PM +0100, Matthew Wilcox wrote: > > > On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote: > > > > Upon closer investigation, it was found that in current codebase, lookup_bh_lru > > > > is slower than __find_get_block_slow: > > > > > > > > 114 ns per __find_get_block > > > > 68 ns per __find_get_block_slow > > > > > > > > So remove the per-CPU buffer_head caching. > > > > > > LOL. That's amazing. I can't even see why it's so expensive. The > > > local_irq_disable(), perhaps? Your test case is the best possible > > > one for lookup_bh_lru() where you're not even doing the copy. > > > > I think it's even simpler than that. > > > > i.e. the lookaside cache is being missed, so it's a pure cost and > > the code is always having to call __find_get_block_slow() anyway. > > How does that happen? > > __find_get_block(struct block_device *bdev, sector_t block, unsigned size) > { > struct buffer_head *bh = lookup_bh_lru(bdev, block, size); > > if (bh == NULL) { > /* __find_get_block_slow will mark the page accessed */ > bh = __find_get_block_slow(bdev, block); > if (bh) > bh_lru_install(bh); > > The second (and all subsequent) calls to __find_get_block() should find > the BH in the LRU. > > > IMO, this is an example of how lookaside caches are only a benefit > > if the working set of items largely fits in the lookaside cache and > > the cache lookup itself is much, much slower than a lookaside cache > > miss. > > But the test code he posted always asks for the same buffer each time. > So it should find it in the lookaside cache? Oh. for (i = 0; ....) { bh = __find_get_block(bdev, 1, 512); That's a '1' being passed to __find_get_block, not 'i'. /me goes and gets more coffee. Maybe it's CONFIG_PREEMPT_RT=y doing something to the locks that isn't obvious here... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx