On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote: > On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > There's also the small random 64 byte read case that we haven't optimised > > for yet. That also bottlenecks on the page refcount atomic op. > > > > The proposed solution to that was double-copy; look up the page without > > bumping its refcount, copy to a buffer, look up the page again to be > > sure it's still there, copy from the buffer to userspace. > > Please stop the cray-cray. > > Yes, cache dirtying is expensive. But you don't actually have > cacheline ping-pong, because you don't have lots of different CPU's > hammering the same page cache page in any normal circumstances. So the > really expensive stuff just doesn't exist. Not ping pong, you're just blowing the cachelines you want out of l1 with the big usercopy, hardware caches not being fully associative. > I think you've been staring at profiles too much. In instruction-level > profiles, the atomic ops stand out a lot. But that's at least partly > artificial - they are a serialization point on x86, so things get > accounted to them. So they tend to be the collection point for > everything around them in an OoO CPU. Yes, which leads to a fun game of whack a mole when you eliminate one atomic op and then everything just ends up piling up behind a different atomic op - but for the buffered read path, the folio get/put are the only atomic ops. > Fior example, the fact that Kent complains about the page cache and > talks about large folios is completely ludicrous. I've seen the > benchmarks of real loads. Kent - you're not close to any limits, you > are often a factor of two to five off other filesystems. We're not > talking "a few percent", and we're not talking "the atomics are > hurting". Yes, there's a bunch of places where bcachefs is still slow; it'll get there :) If you've got those benchmarks handyy and they're ones I haven't seen, I'd love to take look; the one that always jumps out at people is small O_DIRECT reads, and that hasn't been a priority because O_DIRECT doesn't matter to most people nearly as much as they think it does. There's a bunch of stuff still to work through; another that comes to mind is that we need a free inodes btree to eliminate scanning in inode create, and that was half a day of work - except it also needs sharding (i.e. leaf nodes can't span certain boundaries), and for that I need variable sized btree nodes so we aren't burning stupid amounts of memory - and that's something we need anyways, number of btrees growing like it is. Another fun one that I just discovered while I was hanging out at Darrick's - journal was stalling on high iodepth workloads; device write buffer fills up, write latency goes up, suddenly the journal can't write quickly enough when it's only submitting one write at a time. So there's a fix for 6.9 queued up that lets the journal keep multiple writes in flight. That one was worth mentioning because another fix would've been to add a way to signal backpressure to /above/ the filesystem, so that we don't hit such big queuing delays within the filesystem; right now user writes don't hit backpressure until submit_bio() blocks because the request queue is full. I've been seeing other performance corner cases where it looks like such a mechanism would be helpful. I except I've got a solid year or two ahead of me of mastly just working through performance bugs - standing up a lot of automated perf testing adn whatnot. But, one thing at a time...