On 12/11/19 6:09 PM, Jens Axboe wrote: > On 12/11/19 4:41 PM, Jens Axboe wrote: >> On 12/11/19 1:18 PM, Linus Torvalds wrote: >>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>> >>>> $ cat /proc/meminfo | grep -i active >>>> Active: 134136 kB >>>> Inactive: 28683916 kB >>>> Active(anon): 97064 kB >>>> Inactive(anon): 4 kB >>>> Active(file): 37072 kB >>>> Inactive(file): 28683912 kB >>> >>> Yeah, that should not put pressure on some swap activity. We have 28 >>> GB of basically free inactive file data, and the VM is doing something >>> very very bad if it then doesn't just quickly free it with no real >>> drama. >>> >>> In fact, I don't think it should even trigger kswapd at all, it should >>> all be direct reclaim. Of course, some of the mm people hate that with >>> a passion, but this does look like a prime example of why it should >>> just be done. >> >> For giggles, I ran just a single thread on the file set. We're only >> doing about 100K IOPS at that point, yet when the page cache fills, >> kswapd still eats 10% cpu. That seems like a lot for something that >> slow. > > Warning, the below is from the really crazy department... > > Anyway, I took a closer look at the profiles for the uncached case. > We're spending a lot of time doing memsets (this is the xa_node init, > from the radix tree constructor), and call_rcu for the node free later > on. All wasted time, and something that meant we weren't as close to the > performance of O_DIRECT as we could be. > > So Chris and I started talking about this, and pondered "what would > happen if we simply bypassed the page cache completely?". Case in point, > see below incremental patch. We still do the page cache lookup, and use > that page to copy from if it's there. If the page isn't there, allocate > one and do IO to it, but DON'T add it to the page cache. With that, > we're almost at O_DIRECT levels of performance for the 4k read case, > without 1-2%. I think 512b would look awesome, but we're reading full > pages, so that won't really help us much. Compared to the previous > uncached method, this is 30% faster on this device. That's substantial. > > Obviously this has issues with truncate that would need to be resolved, > and it's definitely dirtier. But the performance is very enticing... Tested and cleaned a bit, and added truncate protection through inode_dio_begin()/inode_dio_end(). https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87 This is much faster than the previous page cache dance, and I _think_ we're ok as long as we block truncate and hole punching. Comments? -- Jens Axboe