Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 25 Feb 2024 01:04:16 -0500

On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote:
> On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote:
> Before large folios, we had people very much bottlenecked by 4k page
> overhead on sequential IO; my customer/sponsor was one of them.
> 
> Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the
> profiles and looked at the filemap.c code it wasn't hard to see why;
> we'd walk a radix tree, do an atomic op (get the page), then do a 4k
> usercopy... hence the work I did to break up
> generic_file_buffered_read() and vectorize it, which was a huge
> improvement.
> 
> It's definitely less of a factor when post large folios and when we're
> talking about workloads that don't fit in cache, but I always wanted to
> do a generic version of the vectorized write path that brfs and bcachefs
> have.

to expound further, our buffered io performance really is crap vs.
direct in lots of real world scenarios, and what was going on in
generic_file_buffered_read() was just one instance of a larger theme -
walking data structures, taking locks/atomics/barriers, then doing work
on the page/folio with cacheline bounces, in a loop - lots of places
where batching/vectorizing would help a lot but it tends to be
insufficient.

i had patches that went further than the generic_file_buffered_read()
rework to vectorize add_to_page_cache_lru(), and that was another
significant improvement.

the pagecache lru operations were another hot spot... willy and I at one
point were spitballing getting rid of the linked list for a dequeue,
more for getting rid of the list_head in struct page/folio and replacing
it with a single size_t index, but it'd open up more vectorizing
possibilities

i give willy crap about the .readahead interface... the way we make the
filesystem code walk the xarray to get the folios instead of just
passing it a vector is stupid

folio_batch is stupid, it shouldn't be fixed size. there's no reason for
that to be a small fixed size array on the stack, the slub fastpath has
no atomic ops and doesn't disable preemption or interrupts - it's
_fast_. just use a darray and vectorize the whole operation

but that wouldn't be the big gains, bigger would be hunting down all the
places that aren't vectorized and should be.

i haven't reviewed the recent .writepages work christoph et all are
doing, if that's properly vectorized now that'll help