On 12 Dec 2019, at 17:18, Dave Chinner wrote: > On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote: >> >> So Chris and I started talking about this, and pondered "what would >> happen if we simply bypassed the page cache completely?". Case in >> point, >> see below incremental patch. We still do the page cache lookup, and >> use >> that page to copy from if it's there. If the page isn't there, >> allocate >> one and do IO to it, but DON'T add it to the page cache. With that, >> we're almost at O_DIRECT levels of performance for the 4k read case, >> without 1-2%. I think 512b would look awesome, but we're reading full >> pages, so that won't really help us much. Compared to the previous >> uncached method, this is 30% faster on this device. That's >> substantial. > > Interesting idea, but this seems like it is just direct IO with > kernel pages and a memcpy() rather than just mapping user pages, but > has none of the advantages of direct IO in that we can run reads and > writes concurrently because it's going through the buffered IO path. > > It also needs all the special DIO truncate/hole punch serialisation > mechanisms to be propagated into the buffered IO path - the > requirement for inode_dio_wait() serialisation is something I'm > trying to remove from XFS, not have to add into more paths. And it > introduces the same issues with other buffered read/mmap access to > the same file ranges as direct IO has. > >> Obviously this has issues with truncate that would need to be >> resolved, >> and it's definitely dirtier. But the performance is very enticing... > > At which point I have to ask: why are we considering repeating the > mistakes that were made with direct IO? Yes, it might be faster > than a coherent RWF_UNCACHED IO implementation, but I don't think > making it more like O_DIRECT is worth the price. > > And, ultimately, RWF_UNCACHED will never be as fast as direct IO > because it *requires* the CPU to copy the data at least once. They just have different tradeoffs. O_DIRECT actively blows away caches and can also force writes during reads, making RWF_UNCACHED a more natural fit for some applications. There are fewer surprises, and some services are willing to pay for flexibility with a memcpy. In general, they still want to do some cache management because it reduces p90+ latencies across the board, and gives them more control over which pages stay in cache. Most services using buffered IO here as part of their main workload are pairing it with sync_file_range() and sometimes fadvise DONT_NEED. We've seen kswapd saturating cores with much slower flash than the fancy stuff Jens is using, and the solution is usually O_DIRECT or fadvise. Grepping through the code shows a wonderful assortment of helpers to control the cache, and RWF_UNCACHED would be both cleaner and faster than what we have today. I'm on the fence about asking for RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without pitching pages, but we can talk to some service owners to see how useful that would be. They can always chain a sync_file_range() in io_uring, but RWF_ would be lower overhead if it were a common pattern. With all of that said, I really agree that xfs+O_DIRECT wins on write concurrency. Jens's current patches are a great first step, but I think that if he really loved us, Jens would carve up a concurrent pageless write patch series before Christmas. > Direct > IO is zero-copy, and so it's always going to have lower overhead > than RWF_UNCACHED, and so when CPU or memory bandwidth is the > limiting facter, O_DIRECT will always be faster. > > IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a > fool's game and attempting to do so is taking a step in the wrong > direction architecturally. I'd much prefer a sane IO model for > RWF_UNCACHED that provides coherency w/ mmap and other buffered IO > than compromise these things in the chase for ultimate performance. No matter what I wrote in my letters to Santa this year, I agree that we shouldn't compromise on avoiding the warts from O_DIRECT. -chris