On Thu, Aug 16, 2018 at 09:57:40AM +0200, Carlos Maiolino wrote: > On Wed, Aug 15, 2018 at 07:26:30PM -0400, Kent Overstreet wrote: > > Small patch series to > > - firstly, refactor generic_file_buffered_read enough that it can be modified > > in more interesting ways without going insane, and then > > > > - secondly, change it to use find_get_pages_contig() to batch up the page > > operations, and then copy data to userspace in a separate loop that touches > > no other shared cachelines. > > > > I've been seeing profiles where the radix tree lookups in the buffered read path > > are a shockingly large portion of the profile (around 25%, if memory serves) - > > that's what this patch series is addressing. I've benchmarked small block reads > > as well, performance there is unaffected or slightly improved (it's within the > > margin of error). > > > > /me didn't review the patches, but... > > Could you share how you benchmarked it? Despite the fact I'm curious about it, > it's going to be interesting the 'proof' of such improvement. I tried coming up with a microbenchmark and gave up because it was getting too ridiculous - you need something _modifying_ the page cache radix tree for the contention to show up. That's usually going to be page reclaim, which means you can't just populate the page cache and run your benchmark, you have to be reading stuff in and evicting stuff. Which means you need waaay higher end IO devices than what I have at home for it to show up. But it's pretty obvious from the profiles of the affected workloads what's going on.