Theodore Ts'o <tytso@xxxxxxx> writes: > Add a new mount option which enables a new "lazytime" mode. This mode > causes atime, mtime, and ctime updates to only be made to the > in-memory version of the inode. The on-disk times will only get > updated when (a) when the inode table block for the inode needs to be > updated for some non-time related change involving any inode in the > block, (b) if userspace calls fsync(), or (c) the refcount on an > undeleted inode goes to zero (in most cases, when the last file > descriptor assoicated with the inode is closed). > > This is legal according to POSIX because there are no guarantees after > a crash unless userspace explicitly requests via a fsync(2) call. So > in fact, this a better way of reducing the disk traffic resulting from > atime is use lazytime instead of relatime or noatime. Enabling > lazytime and disabling the default realtime will result in fewer extra > disk writes, and has the benefits of being POSIX-compliant --- since > either noatime and relatime violates POSIX. Indeed, in fact even current implementation can not guarantee correct mtime in case of power-fail for example: DIO_WRITE_TASK ->file_update_time: journal_start save mtime in journal in tid:B journal_stop ->direct_IO(): modify files's content which may be flushed to non volatile storage. POWER_FAILURE NEW_MOUNT: journal replay will find that tid:B has no commit record and ignore it. As result file has old mtime, but new content > > The lazytime mode reduces pressure on the journal spinlocks, since > time updates resulting from calls to file_update_time() are almost > always done using separate jbd2 handles. For allocating writes, the > inode will need to be updated anyway when i_blocks change, and so the > mtime updates will be folded into jbd2 handle in the ext4 write path. > > In addition, for workloads feature a large number of random write to a > preallocated file, the lazytime mount option significantly reduces > writes to the inode table. The repeated 4k writes to a single block > will result in undesirable stress on flash devices and SMR disk > drives. Even on conventional HDD's, the repeated writes to the inode > table block will trigger Adjacent Track Interference (ATI) remediation > latencies, which very negatively impact 99.9 percentile latencies --- > which is a very big deal for web serving tiers (for example). Also sync mtime updates is a great pain for AIO submitter because AIO submission may be blocked for a seconds (up to 5 second in my case) if inode is part of current committing transaction see: do_get_write_access > > n.b.: because of the many wins of this mode, we may want to enable > lazytime updates by default in the future. If you know of use cases > where having inaccurate mtime values after a crash would be extremely > problematic, please us know at linux-ext4@xxxxxxxxxxxxxxx. > > Google-Bug-Id: 18297052 Yeah we also has ticket for that :) https://jira.sw.ru/browse/PSBM-20411 > > Signed-off-by: Theodore Ts'o <tytso@xxxxxxx> Patch looks good with minor nodes, please see below. > --- > fs/ext4/ext4.h | 3 +++ > fs/ext4/file.c | 1 + > fs/ext4/fsync.c | 3 +++ > fs/ext4/inode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++- > fs/ext4/namei.c | 2 ++ > fs/ext4/super.c | 14 ++++++++++++ > fs/ext4/symlink.c | 2 ++ > fs/inode.c | 36 +++++++++++++++++++++++++++++++ > include/linux/fs.h | 2 ++ > 9 files changed, 124 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index c55a1fa..494c504 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -970,6 +970,7 @@ struct ext4_inode_info { > #define EXT4_MOUNT_ERRORS_MASK 0x00070 > #define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */ > #define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/ > +#define EXT4_MOUNT_LAZYTIME 0x00200 /* Update the time lazily */ > #define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */ > #define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */ > #define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */ > @@ -1407,6 +1408,7 @@ enum { > EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */ > EXT4_STATE_ORDERED_MODE, /* data=ordered mode */ > EXT4_STATE_EXT_PRECACHED, /* extents have been precached */ > + EXT4_STATE_DIRTY_TIME, /* the time needs to be updated */ > }; > > #define EXT4_INODE_BIT_FNS(name, field, offset) \ > @@ -2114,6 +2116,7 @@ extern int ext4_write_inode(struct inode *, struct writeback_control *); > extern int ext4_setattr(struct dentry *, struct iattr *); > extern int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry, > struct kstat *stat); > +extern int ext4_update_time(struct inode *, struct timespec *, int); > extern void ext4_evict_inode(struct inode *); > extern void ext4_clear_inode(struct inode *); > extern int ext4_sync_inode(handle_t *, struct inode *); > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 8131be8..2cf6aaf 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -603,6 +603,7 @@ const struct file_operations ext4_file_operations = { > const struct inode_operations ext4_file_inode_operations = { > .setattr = ext4_setattr, > .getattr = ext4_getattr, > + .update_time = ext4_update_time, > .setxattr = generic_setxattr, > .getxattr = generic_getxattr, > .listxattr = ext4_listxattr, > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index a8bc47f..ba05c83 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -116,6 +116,9 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > ret = filemap_write_and_wait_range(inode->i_mapping, start, end); > if (ret) > return ret; > + > + if (!datasync && ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME)) > + ext4_dirty_inode(inode, 0); > /* > * data=writeback,ordered: > * The caller's filemap_fdatawrite()/wait will sync the data. > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 3356ab5..1b5e4bd 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4163,6 +4163,46 @@ static int ext4_inode_blocks_set(handle_t *handle, > } > > /* > + * Opportunistically update the other time fields for other inodes in > + * the same inode table block. > + */ > +static void ext4_update_other_inodes_time(struct super_block *sb, > + unsigned long orig_ino, char *buf) > +{ > + struct ext4_inode_info *ei; > + struct ext4_inode *raw_inode; > + unsigned long ino; > + struct inode *inode; > + int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block; > + int inode_size = EXT4_INODE_SIZE(sb); > + > + ino = orig_ino & ~(inodes_per_block - 1); > + for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) { > + if (ino == orig_ino) > + continue; > + inode = find_active_inode_nowait(sb, ino); > + if (!inode || > + !ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME)) { > + iput(inode); > + continue; > + } > + raw_inode = (struct ext4_inode *) buf; > + ei = EXT4_I(inode); > + > + smp_mb__before_spinlock(); > + spin_lock(&ei->i_raw_lock); > + ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_TIME); > + EXT4_INODE_GET_XTIME(i_ctime, inode, raw_inode); > + EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode); > + EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode); > + EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode); > + ext4_inode_csum_set(inode, raw_inode, ei); > + spin_unlock(&ei->i_raw_lock); > + iput(inode); > + } > +} > + > +/* > * Post the struct inode info into an on-disk inode location in the > * buffer-cache. This gobbles the caller's reference to the > * buffer_head in the inode location struct. > @@ -4182,7 +4222,9 @@ static int ext4_do_update_inode(handle_t *handle, > uid_t i_uid; > gid_t i_gid; > > + smp_mb__before_spinlock(); > spin_lock(&ei->i_raw_lock); > + ext4_clear_inode_state(inode, EXT4_STATE_DIRTY_TIME); > > /* For fields not tracked in the in-memory inode, > * initialise them to zero for new inodes. */ > @@ -4273,8 +4315,8 @@ static int ext4_do_update_inode(handle_t *handle, > } > > ext4_inode_csum_set(inode, raw_inode, ei); > - > spin_unlock(&ei->i_raw_lock); > + ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data); > > BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata"); > rc = ext4_handle_dirty_metadata(handle, NULL, bh); > @@ -4622,6 +4664,24 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry, > return 0; > } > > +int ext4_update_time(struct inode *inode, struct timespec *time, int flags) > +{ > + if (flags & S_ATIME) > + inode->i_atime = *time; > + if (flags & S_VERSION) > + inode_inc_iversion(inode); > + if (flags & S_CTIME) > + inode->i_ctime = *time; > + if (flags & S_MTIME) > + inode->i_mtime = *time; > + if (test_opt(inode->i_sb, LAZYTIME)) { > + smp_wmb(); > + ext4_set_inode_state(inode, EXT4_STATE_DIRTY_TIME); Since we want update all in-memory data we also have to explicitly update inode->i_version Which was previously updated implicitly here: mark_inode_dirty_sync() ->__mark_inode_dirty ->ext4_dirty_inode ->ext4_mark_inode_dirty ->ext4_mark_iloc_dirty ->inode_inc_iversion(inode); > + } else > + mark_inode_dirty_sync(inode); > + return 0; > +} > + > static int ext4_index_trans_blocks(struct inode *inode, int lblocks, > int pextents) > { > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index 4262118..f782040 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -3532,6 +3532,7 @@ const struct inode_operations ext4_dir_inode_operations = { > .tmpfile = ext4_tmpfile, > .rename2 = ext4_rename2, > .setattr = ext4_setattr, > + .update_time = ext4_update_time, > .setxattr = generic_setxattr, > .getxattr = generic_getxattr, > .listxattr = ext4_listxattr, > @@ -3545,6 +3546,7 @@ const struct inode_operations ext4_special_inode_operations = { > .setattr = ext4_setattr, > .setxattr = generic_setxattr, > .getxattr = generic_getxattr, > + .update_time = ext4_update_time, > .listxattr = ext4_listxattr, > .removexattr = generic_removexattr, > .get_acl = ext4_get_acl, > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index 2c9e686..16c9983 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -910,6 +910,14 @@ static int ext4_drop_inode(struct inode *inode) > int drop = generic_drop_inode(inode); > > trace_ext4_drop_inode(inode, drop); > + if (!drop && ext4_test_inode_state(inode, EXT4_STATE_DIRTY_TIME)) { > + atomic_inc(&inode->i_count); > + spin_unlock(&inode->i_lock); > + ext4_dirty_inode(inode, 0); > + spin_lock(&inode->i_lock); > + if (atomic_dec_and_test(&inode->i_count)) > + drop = generic_drop_inode(inode); > + } > return drop; > } > > @@ -1142,6 +1150,7 @@ enum { > Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err, > Opt_usrquota, Opt_grpquota, Opt_i_version, > Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit, > + Opt_lazytime, Opt_nolazytime, > Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity, > Opt_inode_readahead_blks, Opt_journal_ioprio, > Opt_dioread_nolock, Opt_dioread_lock, > @@ -1204,6 +1213,8 @@ static const match_table_t tokens = { > {Opt_i_version, "i_version"}, > {Opt_stripe, "stripe=%u"}, > {Opt_delalloc, "delalloc"}, > + {Opt_lazytime, "lazytime"}, > + {Opt_nolazytime, "nolazytime"}, > {Opt_nodelalloc, "nodelalloc"}, > {Opt_removed, "mblk_io_submit"}, > {Opt_removed, "nomblk_io_submit"}, > @@ -1361,6 +1372,8 @@ static const struct mount_opts { > MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT}, > {Opt_nodelalloc, EXT4_MOUNT_DELALLOC, > MOPT_EXT4_ONLY | MOPT_CLEAR}, > + {Opt_lazytime, EXT4_MOUNT_LAZYTIME, MOPT_SET}, > + {Opt_nolazytime, EXT4_MOUNT_LAZYTIME, MOPT_CLEAR}, > {Opt_journal_checksum, EXT4_MOUNT_JOURNAL_CHECKSUM, > MOPT_EXT4_ONLY | MOPT_SET}, > {Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT | > @@ -3514,6 +3527,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) > > /* Set defaults before we parse the mount options */ > def_mount_opts = le32_to_cpu(es->s_default_mount_opts); > + set_opt(sb, LAZYTIME); > set_opt(sb, INIT_INODE_TABLE); > if (def_mount_opts & EXT4_DEFM_DEBUG) > set_opt(sb, DEBUG); > diff --git a/fs/ext4/symlink.c b/fs/ext4/symlink.c > index ff37119..7c92b93 100644 > --- a/fs/ext4/symlink.c > +++ b/fs/ext4/symlink.c > @@ -35,6 +35,7 @@ const struct inode_operations ext4_symlink_inode_operations = { > .follow_link = page_follow_link_light, > .put_link = page_put_link, > .setattr = ext4_setattr, > + .update_time = ext4_update_time, > .setxattr = generic_setxattr, > .getxattr = generic_getxattr, > .listxattr = ext4_listxattr, > @@ -45,6 +46,7 @@ const struct inode_operations ext4_fast_symlink_inode_operations = { > .readlink = generic_readlink, > .follow_link = ext4_follow_link, > .setattr = ext4_setattr, > + .update_time = ext4_update_time, > .setxattr = generic_setxattr, > .getxattr = generic_getxattr, > .listxattr = ext4_listxattr, > diff --git a/fs/inode.c b/fs/inode.c > index 26753ba..cde073a 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1280,6 +1280,42 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino) > } > EXPORT_SYMBOL(ilookup); > > +/** > + * find_active_inode_nowait - find an active inode in the inode cache > + * @sb: super block of file system to search > + * @ino: inode number to search for > + * > + * Search for an active inode @ino in the inode cache, and if the > + * inode is in the cache, the inode is returned with an incremented > + * reference count. If the inode is being freed or is newly > + * initialized, return nothing instead of trying to wait for the inode > + * initialization or destruction to be complete. > + */ > +struct inode *find_active_inode_nowait(struct super_block *sb, > + unsigned long ino) > +{ > + struct hlist_head *head = inode_hashtable + hash(sb, ino); > + struct inode *inode, *ret_inode = NULL; > + > + spin_lock(&inode_hash_lock); > + hlist_for_each_entry(inode, head, i_hash) { > + if ((inode->i_ino != ino) || > + (inode->i_sb != sb)) > + continue; > + spin_lock(&inode->i_lock); > + if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) { > + __iget(inode); > + ret_inode = inode; > + } > + spin_unlock(&inode->i_lock); > + goto out; > + } > +out: > + spin_unlock(&inode_hash_lock); > + return ret_inode; > +} > +EXPORT_SYMBOL(find_active_inode_nowait); > + > int insert_inode_locked(struct inode *inode) > { > struct super_block *sb = inode->i_sb; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9ab779e..b5e6b6b 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2410,6 +2410,8 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino); > > extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); > extern struct inode * iget_locked(struct super_block *, unsigned long); > +extern struct inode *find_active_inode_nowait(struct super_block *, > + unsigned long); > extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *); > extern int insert_inode_locked(struct inode *); > #ifdef CONFIG_DEBUG_LOCK_ALLOC > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
signature.asc
Description: PGP signature