From: Jeff Layton <jlayton@xxxxxxxxxx> About a year ago, I sent a pile of patches that overhauled how the inode->i_version field is handled in filesystems. This is a follow up to that initial series. tl;dr: I think we can greatly reduce the cost of the inode->i_version counter, by exploiting the fact that we don't need to increment it if no one is looking at it. We can also clean up the code to prepare to eventually expose this value via statx(). The inode->i_version field is supposed to be a value that changes whenever there is any data or metadata change to the inode. Some filesystems use it internally to detect directory changes during readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA will also use it to optimize away some remeasurement if it's available. Only btrfs, ext4, and xfs implement it for data changes. Because of this, these filesystems must log the inode to disk whenever the i_version counter changes. That has a non-zero performance impact, especially on write-heavy workloads, because we end up dirtying the inode metadata on every write, not just when the times change. [1] It turns out though that none of these users of i_version require that i_version change on every change to the file. The only real requirement is that it be different if _something_ changed since the last time we queried for it. If we keep track of when something queries the value, we can avoid bumping the counter and an on-disk update when nothing else has changed if no one has queried it since it was last incremented. This patchset changes the code to only bump the i_version counter when it's strictly necessary, or when we're updating the inode metadata anyway (e.g. when times change). It takes the approach of converting the existing accessors of i_version to use a new API, while leaving the underlying implementation mostly the same. The last patch then converts the existing implementation to keep track of whether the value has been queried since it was last incremented and uses that to avoid incrementing the counter when it can. With this, we reduce inode metadata updates across all 3 filesystems down to roughly the frequency of the timestamp granularity, particularly when it's not being queried (the vastly common case). The pessimal workload here is 1 byte writes, and it helps that significantly. Of course, that's not what we'd consider a real-world workload. A tiobench-example.fio workload also shows some modest performance gains, and I've gotten mails from the kernel test robot that show some significant performance gains on some microbenchmarks (case-msync-mt in the vm-scalability testsuite to be specific), with an earlier version of this set. With larger writes, the gains with this patchset mostly vaporize, but it does not seem to cause performance to regress anywhere, AFAICT. I'm happy to run other workloads if anyone can suggest them. At this point, the patchset works and does what it's expected to do in my own testing. It seems like it's at least a modest performance win across all 3 major disk-based filesystems. It may also encourage others to implement i_version as well since it reduces the cost. [1]: On ext4 it must be turned on with the i_version mount option, mostly due to fears of incurring this impact, AFAICT. Jeff Layton (19): fs: new API for handling inode->i_version fs: don't take the i_lock in inode_inc_iversion fat: convert to new i_version API affs: convert to new i_version API afs: convert to new i_version API btrfs: convert to new i_version API exofs: switch to new i_version API ext2: convert to new i_version API ext4: convert to new i_version API nfs: convert to new i_version API nfsd: convert to new i_version API ocfs2: convert to new i_version API ufs: use new i_version API xfs: convert to new i_version API IMA: switch IMA over to new i_version API fs: only set S_VERSION when updating times if necessary xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing btrfs: only dirty the inode in btrfs_update_time if something was changed fs: handle inode->i_version more efficiently fs/affs/amigaffs.c | 4 +- fs/affs/dir.c | 4 +- fs/affs/super.c | 2 +- fs/afs/fsclient.c | 2 +- fs/afs/inode.c | 4 +- fs/btrfs/delayed-inode.c | 6 +- fs/btrfs/inode.c | 11 +- fs/btrfs/tree-log.c | 3 +- fs/exofs/dir.c | 8 +- fs/exofs/super.c | 2 +- fs/ext2/dir.c | 8 +- fs/ext2/super.c | 4 +- fs/ext4/dir.c | 8 +- fs/ext4/inline.c | 6 +- fs/ext4/inode.c | 12 +- fs/ext4/ioctl.c | 2 +- fs/ext4/namei.c | 4 +- fs/ext4/super.c | 2 +- fs/ext4/xattr.c | 4 +- fs/fat/dir.c | 2 +- fs/fat/inode.c | 8 +- fs/fat/namei_msdos.c | 6 +- fs/fat/namei_vfat.c | 20 +-- fs/inode.c | 9 +- fs/nfs/delegation.c | 2 +- fs/nfs/fscache-index.c | 4 +- fs/nfs/inode.c | 16 +-- fs/nfs/nfs4proc.c | 9 +- fs/nfs/nfstrace.h | 4 +- fs/nfs/write.c | 7 +- fs/nfsd/nfsfh.h | 2 +- fs/ocfs2/dir.c | 14 +-- fs/ocfs2/inode.c | 2 +- fs/ocfs2/namei.c | 2 +- fs/ocfs2/quota_global.c | 2 +- fs/ufs/dir.c | 8 +- fs/ufs/inode.c | 2 +- fs/ufs/super.c | 2 +- fs/xfs/libxfs/xfs_inode_buf.c | 5 +- fs/xfs/xfs_icache.c | 4 +- fs/xfs/xfs_inode.c | 2 +- fs/xfs/xfs_inode_item.c | 2 +- fs/xfs/xfs_trans_inode.c | 14 ++- include/linux/fs.h | 250 ++++++++++++++++++++++++++++++++++++-- security/integrity/ima/ima_api.c | 2 +- security/integrity/ima/ima_main.c | 2 +- 46 files changed, 371 insertions(+), 127 deletions(-) -- 2.14.3