On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote: > On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote: > Before large folios, we had people very much bottlenecked by 4k page > overhead on sequential IO; my customer/sponsor was one of them. > > Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the > profiles and looked at the filemap.c code it wasn't hard to see why; > we'd walk a radix tree, do an atomic op (get the page), then do a 4k > usercopy... hence the work I did to break up > generic_file_buffered_read() and vectorize it, which was a huge > improvement. > > It's definitely less of a factor when post large folios and when we're > talking about workloads that don't fit in cache, but I always wanted to > do a generic version of the vectorized write path that brfs and bcachefs > have. to expound further, our buffered io performance really is crap vs. direct in lots of real world scenarios, and what was going on in generic_file_buffered_read() was just one instance of a larger theme - walking data structures, taking locks/atomics/barriers, then doing work on the page/folio with cacheline bounces, in a loop - lots of places where batching/vectorizing would help a lot but it tends to be insufficient. i had patches that went further than the generic_file_buffered_read() rework to vectorize add_to_page_cache_lru(), and that was another significant improvement. the pagecache lru operations were another hot spot... willy and I at one point were spitballing getting rid of the linked list for a dequeue, more for getting rid of the list_head in struct page/folio and replacing it with a single size_t index, but it'd open up more vectorizing possibilities i give willy crap about the .readahead interface... the way we make the filesystem code walk the xarray to get the folios instead of just passing it a vector is stupid folio_batch is stupid, it shouldn't be fixed size. there's no reason for that to be a small fixed size array on the stack, the slub fastpath has no atomic ops and doesn't disable preemption or interrupts - it's _fast_. just use a darray and vectorize the whole operation but that wouldn't be the big gains, bigger would be hunting down all the places that aren't vectorized and should be. i haven't reviewed the recent .writepages work christoph et all are doing, if that's properly vectorized now that'll help