On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote: > On 12/11/19 4:41 PM, Jens Axboe wrote: > > On 12/11/19 1:18 PM, Linus Torvalds wrote: > >> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > >>> > >>> $ cat /proc/meminfo | grep -i active > >>> Active: 134136 kB > >>> Inactive: 28683916 kB > >>> Active(anon): 97064 kB > >>> Inactive(anon): 4 kB > >>> Active(file): 37072 kB > >>> Inactive(file): 28683912 kB > >> > >> Yeah, that should not put pressure on some swap activity. We have 28 > >> GB of basically free inactive file data, and the VM is doing something > >> very very bad if it then doesn't just quickly free it with no real > >> drama. > >> > >> In fact, I don't think it should even trigger kswapd at all, it should > >> all be direct reclaim. Of course, some of the mm people hate that with > >> a passion, but this does look like a prime example of why it should > >> just be done. > > > > For giggles, I ran just a single thread on the file set. We're only > > doing about 100K IOPS at that point, yet when the page cache fills, > > kswapd still eats 10% cpu. That seems like a lot for something that > > slow. > > Warning, the below is from the really crazy department... > > Anyway, I took a closer look at the profiles for the uncached case. > We're spending a lot of time doing memsets (this is the xa_node init, > from the radix tree constructor), and call_rcu for the node free later > on. All wasted time, and something that meant we weren't as close to the > performance of O_DIRECT as we could be. > > So Chris and I started talking about this, and pondered "what would > happen if we simply bypassed the page cache completely?". Case in point, > see below incremental patch. We still do the page cache lookup, and use > that page to copy from if it's there. If the page isn't there, allocate > one and do IO to it, but DON'T add it to the page cache. With that, > we're almost at O_DIRECT levels of performance for the 4k read case, > without 1-2%. I think 512b would look awesome, but we're reading full > pages, so that won't really help us much. Compared to the previous > uncached method, this is 30% faster on this device. That's substantial. Interesting idea, but this seems like it is just direct IO with kernel pages and a memcpy() rather than just mapping user pages, but has none of the advantages of direct IO in that we can run reads and writes concurrently because it's going through the buffered IO path. It also needs all the special DIO truncate/hole punch serialisation mechanisms to be propagated into the buffered IO path - the requirement for inode_dio_wait() serialisation is something I'm trying to remove from XFS, not have to add into more paths. And it introduces the same issues with other buffered read/mmap access to the same file ranges as direct IO has. > Obviously this has issues with truncate that would need to be resolved, > and it's definitely dirtier. But the performance is very enticing... At which point I have to ask: why are we considering repeating the mistakes that were made with direct IO? Yes, it might be faster than a coherent RWF_UNCACHED IO implementation, but I don't think making it more like O_DIRECT is worth the price. And, ultimately, RWF_UNCACHED will never be as fast as direct IO because it *requires* the CPU to copy the data at least once. Direct IO is zero-copy, and so it's always going to have lower overhead than RWF_UNCACHED, and so when CPU or memory bandwidth is the limiting facter, O_DIRECT will always be faster. IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a fool's game and attempting to do so is taking a step in the wrong direction architecturally. I'd much prefer a sane IO model for RWF_UNCACHED that provides coherency w/ mmap and other buffered IO than compromise these things in the chase for ultimate performance. Speaking of IO path architecture, perhaps what we really need here is an iomap_apply()->iomap_read_actor loop here similar to the write side. This would allow us to bypass all the complex readahead shenanigans that generic_file_buffered_read() has to deal with and directly control page cache residency and build the exact IOs we need when RWF_UNCACHED is set. This moves it much closer to the direct IO path in terms IO setup overhead and physical IO patterns, but still has all the benefits of being fully cache coherent.... And, really, when we are talking about high end nvme drives that can do 5-10GB/s read each, and we can put 20+ of them in a single machine, there's no real value in doing readahead. i.e. there's little read IO latency to hide in the first place and we such systems have little memory bandwidth to spare to waste on readahead IO that we don't end up using... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx