On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote: > On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > > What are the limits to buffered IO > > > and how do we test that? Who keeps track of it? > > > > TLDR: Why does the pagecache suck? > > What? No. > > Our page cache is so good that the question is literally "what are the > limits of it", and "how we would measure them". > > That's not a sign of suckage. > > When you have to have completely unrealistic loads that nobody would > actually care about in reality just to get a number for the limit, > it's not a sign of problems. > > Or rather, the "problem" is the person looking at a stupid load, and > going "we need to improve this because I can write a benchmark for > this". > > Here's a clue: a hardware discussion forum I visit was arguing about > memory latencies, and talking about how their measured overhead of > DRAM latency was literally 85% on the CPU side, not the DRAM side. > > Guess what? It's because the CPU in question had quite a bit of L3, > and it was spread out, and the CPU doesn't even start the memory > access before it has checked caches. > > And here's a big honking clue: only a complete nincompoop and mentally > deficient rodent would look at that and say "caches suck". > > > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > > > Vs > > > ~ 7,000 MiB/s with buffered IO > > > > Profile? My guess is that you're bottlenecked on the xa_lock between > > memory reclaim removing folios from the page cache and the various > > threads adding folios to the page cache. > > I doubt it's the locking. > > In fact, for writeout in particular it's probably not even the page > cache at all. > > For writeout, we have a very traditional problem: we care about a > million times more about latency than we care about throughput, > because nobody ever actually cares all that much about performance of > huge writes. Before large folios, we had people very much bottlenecked by 4k page overhead on sequential IO; my customer/sponsor was one of them. Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the profiles and looked at the filemap.c code it wasn't hard to see why; we'd walk a radix tree, do an atomic op (get the page), then do a 4k usercopy... hence the work I did to break up generic_file_buffered_read() and vectorize it, which was a huge improvement. It's definitely less of a factor when post large folios and when we're talking about workloads that don't fit in cache, but I always wanted to do a generic version of the vectorized write path that brfs and bcachefs have.