Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 27 Feb 2024 02:21:59 -0500

On Mon, Feb 26, 2024 at 03:48:35PM -0800, Linus Torvalds wrote:
> On Mon, 26 Feb 2024 at 14:46, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > I really haven't tested this AT ALL. I'm much too scared.
> 
> "Courage is not the absence of fear, but acting in spite of it"
>          - Paddington Bear / Michal Scott
> 
> It seems to actually boot here.
> 
> That said, from a quick test with lots of threads all hammering on the
> same page - I'm still not entirely convinced it makes a difference.
> Sure, the kernel profile changes, but filemap_get_read_batch() wasn't
> very high up in the profile to begin with.
> 
> I didn't do any actual performance testing, I just did a 64-byte pread
> at offset 0 in a loop in 64 threads on my 32c/64t machine.

Only rough testing, but  this is looking like around a 25% performance
increase doing 4k random reads on a 1G file with fio, 8 jobs, on my
Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're
only spending half our cpu time in the kernel, so the buffered read path
is actually getting 40% or 50% faster.

So I'd say that's substantial.

RCU freeing of pagecache pages would be even better - I think that'd let
us completely get rid of the barrier & xarray recheck, and we wouldn't
have to do it as a silly special case.