On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote: > On 10/25/2016 06:01 PM, Dave Chinner wrote: > >On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > >>With anything that populates the inode/dentry cache with a lot of one time use > >>inodes we can really put a lot of pressure on the system for things we don't > >>need to keep in cache. It takes two runs through the LRU to evict these one use > >>entries, and if you have a lot of memory you can end up with 10's of millions of > >>entries in the dcache or icache that have never actually been touched since they > >>were first instantiated, and it will take a lot of CPU and a lot of pressure to > >>evict all of them. > >> > >>So instead do what we do with pagecache, only set the *REFERENCED flags if we > >>are being used after we've been put onto the LRU. This makes a significant > >>difference in the system's ability to evict these useless cache entries. With a > >>fs_mark workload that creates 40 million files we get the following results (all > >>in files/sec) > > > >What's the workload, storage, etc? > > Oops sorry I thought I said it. It's fs_mark creating 20 million > empty files on a single NVME drive. How big is the drive/filesystem (e.g. has impact on XFS allocation concurrency)? And multiple btrfs subvolumes, too, by the sound of it. How did you set those up? What about concurrency, directory sizes, etc? Can you post the fsmark command line as these details do actually matter... Getting the benchmark configuration to reproduce posted results should not require playing 20 questions! > >>The reason Btrfs has a much larger improvement is because it holds a lot more > >>things in memory so benefits more from faster slab reclaim, but across the board > >>is an improvement for each of the file systems. > > > >Less than 1% for XFS and ~1.5% for ext4 is well within the > >run-to-run variation of fsmark. It looks like it might be slightly > >faster, but it's not a cut-and-dried win for anything other than > >btrfs. > > > > Sure the win in this benchmark is clearly benefiting btrfs the most, > but I think the overall approach is sound and likely to help > everybody in theory. Yup, but without an explanation of why it makes such a big change to btrfs, we can't really say what effect it's really going to have. Why does cycling the inode a second time through the LRU make any difference? Once we're in steady state on this workload, one or two cycles through the LRU should make no difference at all to performance because all the inodes are instantiated in identical states (including the referenced bit) and so scanning treats every inode identically. i.e. the reclaim rate (i.e. how often evict_inode() is called) should be exactly the same and the only difference is the length of time between the inode being put on the LRU and when it is evicted. Is there an order difference in reclaim as a result of earlier reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs rather than contending on spinning locks? Is it having to wait for transaction commits or some other metadata IO because reclaim is happening earlier? i.e. The question I'm asking is what, exactly, leads to such a marked performance improvement in steady state behaviour? I want to know because if there's behavioural changes in LRU reclaim order having a significant effect on affecting btrfs, then there is going to be some effects on other filesystems, too. Maybe not in this benchmark, but we can't anticipate potential problems if we don't understand exactly what is going on here. > Inside FB we definitely have had problems > where the memory pressure induced by some idi^H^H^Hprocess goes > along and runs find / which causes us to evict real things that are > being used rather than these one use inodes. That's one of the problems the IO throttling in the XFS shrinker tends to avoid. i.e. This is one of the specific cases we expect to see on all production systems - backup applications are the common cause of regular full filesystem traversals. FWIW, there's an element of deja vu in this thread: that XFS inode cache shrinker IO throttling is exactly what Chris proposed we gut last week to solve some other FB memory reclaim problem that had no explanation of the root cause. (http://www.spinics.net/lists/linux-xfs/msg01541.html) > This sort of behavior > could possibly be mitigated by this patch, but I haven't sat down to > figure out a reliable way to mirror this workload to test that > theory. Thanks I use fsmark to create filesystems with tens of millions of small files, then do things like run concurrent greps repeatedly over a small portion of the directory structure (e.g. 1M cached inodes and 4GB worth of cached data) and then run concurrent find traversals across the entire filesystem to simulate this sort use-once vs working set reclaim... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html