On Wed, Jun 26, 2024 at 09:00:24PM -0400, Jeff Layton wrote: > The VFS always uses coarse-grained timestamps when updating the ctime > and mtime after a change. This has the benefit of allowing filesystems > to optimize away a lot metadata updates, down to around 1 per jiffy, > even when a file is under heavy writes. > > Unfortunately, this has always been an issue when we're exporting via > NFSv3, which relies on timestamps to validate caches. A lot of changes > can happen in a jiffy, so timestamps aren't sufficient to help the > client decide to invalidate the cache. Even with NFSv4, a lot of > exported filesystems don't properly support a change attribute and are > subject to the same problems with timestamp granularity. Other > applications have similar issues with timestamps (e.g backup > applications). > > If we were to always use fine-grained timestamps, that would improve the > situation, but that becomes rather expensive, as the underlying > filesystem would have to log a lot more metadata updates. > > What we need is a way to only use fine-grained timestamps when they are > being actively queried. Now that the ctime is stored as a ktime_t, we > can sacrifice the lowest bit in the word to act as a flag marking > whether the current timestamp has been queried via stat() or the like. > > This solves the problem of being able to distinguish the timestamp > between updates, but introduces a new problem: it's now possible for a > file being changed to get a fine-grained timestamp and then a file that > was altered later to get a coarse-grained one that appears older than > the earlier fine-grained time. To remedy this, keep a global ktime_t > value that acts as a timestamp floor. > > When we go to stamp a file, we first get the latter of the current floor > value and the current coarse-grained time (call this "now"). If the > current inode ctime hasn't been queried then we just attempt to stamp it > with that value using a cmpxchg() operation. > > If it has been queried, then first see whether the current coarse time > appears later than what we have. If it does, then we accept that value. > If it doesn't, then we get a fine-grained time and try to swap that into > the global floor. Whether that succeeds or fails, we take the resulting > floor time and try to swap that into the ctime. > > There is still one remaining problem: > > All of this works as long as the realtime clock is monotonically > increasing. If the clock ever jumps backwards, then we could end up in a > situation where the floor value is "stuck" far in advance of the clock. > > To remedy this, sanity check the floor value and if it's more than 6ms > (~2 jiffies) ahead of the current coarse-grained clock, disregard the > floor value, and just accept the current coarse-grained clock. > > Filesystems opt into this by setting the FS_MGTIME fstype flag. One > caveat: those that do will always present ctimes that have the lowest > bit unset, even when the on-disk ctime has it set. > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > --- > fs/inode.c | 168 +++++++++++++++++++++++++++++++++------ > fs/stat.c | 39 ++++++++- > include/linux/fs.h | 30 +++++++ > include/trace/events/timestamp.h | 97 ++++++++++++++++++++++ > 4 files changed, 306 insertions(+), 28 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index 5d2b0dfe48c3..12790a26102c 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init; > static struct hlist_head *inode_hashtable __ro_after_init; > static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock); > > +/* Don't send out a ctime lower than this (modulo backward clock jumps). */ > +static __cacheline_aligned_in_smp ktime_t ctime_floor; This is piece of memory that will be hit pretty hard (and you obviously recognize that because of the alignment attribute). Would it be of any benefit to keep a distinct ctime_floor in each super block instead? > /* > * Empty aops. Can be used for the cases where the user does not > * define any of the address_space operations. > @@ -2077,19 +2079,86 @@ int file_remove_privs(struct file *file) > } > EXPORT_SYMBOL(file_remove_privs); > > +/* > + * The coarse-grained clock ticks once per jiffy (every 2ms or so). If the > + * current floor is >6ms in the future, assume that the clock has jumped > + * backward. > + */ > +#define CTIME_FLOOR_MAX_NS 6000000 > + > +/** > + * coarse_ctime - return the current coarse-grained time > + * @floor: current ctime_floor value > + * > + * Get the coarse-grained time, and then determine whether to > + * return it or the current floor value. Returns the later of the > + * floor and coarse grained time, unless the floor value is too > + * far into the future. If that happens, assume the clock has jumped > + * backward, and that the floor should be ignored. > + */ > +static ktime_t coarse_ctime(ktime_t floor) > +{ > + ktime_t now = ktime_get_coarse_real() & ~I_CTIME_QUERIED; > + > + /* If coarse time is already newer, return that */ > + if (ktime_before(floor, now)) > + return now; > + > + /* Ensure floor is not _too_ far in the future */ > + if (ktime_after(floor, now + CTIME_FLOOR_MAX_NS)) > + return now; > + > + return floor; > +} > + > +/** > + * current_time - Return FS time (possibly fine-grained) > + * @inode: inode. > + * > + * Return the current time truncated to the time granularity supported by > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > + * as having been QUERIED, get a fine-grained timestamp. > + */ > +struct timespec64 current_time(struct inode *inode) > +{ > + ktime_t ctime, floor = smp_load_acquire(&ctime_floor); > + ktime_t now = coarse_ctime(floor); > + struct timespec64 now_ts = ktime_to_timespec64(now); > + > + if (!is_mgtime(inode)) > + goto out; > + > + /* If nothing has queried it, then coarse time is fine */ > + ctime = smp_load_acquire(&inode->__i_ctime); > + if (ctime & I_CTIME_QUERIED) { > + /* > + * If there is no apparent change, then > + * get a fine-grained timestamp. > + */ > + if ((now | I_CTIME_QUERIED) == ctime) { > + ktime_get_real_ts64(&now_ts); > + now_ts.tv_nsec &= ~I_CTIME_QUERIED; > + } > + } > +out: > + return timestamp_truncate(now_ts, inode); > +} > +EXPORT_SYMBOL(current_time); > + > static int inode_needs_update_time(struct inode *inode) > { > + struct timespec64 now, ts; > int sync_it = 0; > - struct timespec64 now = current_time(inode); > - struct timespec64 ts; > > /* First try to exhaust all avenues to not sync */ > if (IS_NOCMTIME(inode)) > return 0; > > + now = current_time(inode); > + > ts = inode_get_mtime(inode); > if (!timespec64_equal(&ts, &now)) > - sync_it = S_MTIME; > + sync_it |= S_MTIME; > > ts = inode_get_ctime(inode); > if (!timespec64_equal(&ts, &now)) > @@ -2485,25 +2554,6 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode) > } > EXPORT_SYMBOL(timestamp_truncate); > > -/** > - * current_time - Return FS time > - * @inode: inode. > - * > - * Return the current time truncated to the time granularity supported by > - * the fs. > - * > - * Note that inode and inode->sb cannot be NULL. > - * Otherwise, the function warns and returns time without truncation. > - */ > -struct timespec64 current_time(struct inode *inode) > -{ > - struct timespec64 now; > - > - ktime_get_coarse_real_ts64(&now); > - return timestamp_truncate(now, inode); > -} > -EXPORT_SYMBOL(current_time); > - > /** > * inode_get_ctime - fetch the current ctime from the inode > * @inode: inode from which to fetch ctime > @@ -2518,12 +2568,18 @@ struct timespec64 inode_get_ctime(const struct inode *inode) > { > ktime_t ctime = inode->__i_ctime; > > + if (is_mgtime(inode)) > + ctime &= ~I_CTIME_QUERIED; > return ktime_to_timespec64(ctime); > } > EXPORT_SYMBOL(inode_get_ctime); > > struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts) > { > + trace_inode_set_ctime_to_ts(inode, &ts); > + > + if (is_mgtime(inode)) > + ts.tv_nsec &= ~I_CTIME_QUERIED; > inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec); > trace_inode_set_ctime_to_ts(inode, &ts); > return ts; > @@ -2535,14 +2591,74 @@ EXPORT_SYMBOL(inode_set_ctime_to_ts); > * @inode: inode > * > * Set the inode->i_ctime to the current value for the inode. Returns > - * the current value that was assigned to i_ctime. > + * the current value that was assigned to i_ctime. If this is a not > + * multigrain inode, then we just set it to whatever the coarse time is. > + * > + * If it is multigrain, then we first see if the coarse-grained > + * timestamp is distinct from what we have. If so, then we'll just use > + * that. If we have to get a fine-grained timestamp, then do so, and > + * try to swap it into the floor. We accept the new floor value > + * regardless of the outcome of the cmpxchg. After that, we try to > + * swap the new value into __i_ctime. Again, we take the resulting > + * ctime, regardless of the outcome of the swap. > */ > struct timespec64 inode_set_ctime_current(struct inode *inode) > { > - struct timespec64 now = current_time(inode); > + ktime_t ctime, now, cur, floor = smp_load_acquire(&ctime_floor); > + > + now = coarse_ctime(floor); > > - inode_set_ctime_to_ts(inode, now); > - return now; > + /* Just return that if this is not a multigrain fs */ > + if (!is_mgtime(inode)) { > + inode->__i_ctime = now; > + goto out; > + } > + > + /* > + * We only need a fine-grained time if someone has queried it, > + * and the current coarse grained time isn't later than what's > + * already there. > + */ > + ctime = smp_load_acquire(&inode->__i_ctime); > + if ((ctime & I_CTIME_QUERIED) && !ktime_after(now, ctime & ~I_CTIME_QUERIED)) { > + ktime_t old; > + > + /* Get a fine-grained time */ > + now = ktime_get_real() & ~I_CTIME_QUERIED; > + > + /* > + * If the cmpxchg works, we take the new floor value. If > + * not, then that means that someone else changed it after we > + * fetched it but before we got here. That value is just > + * as good, so keep it. > + */ > + old = cmpxchg(&ctime_floor, floor, now); > + trace_ctime_floor_update(inode, floor, now, old); > + if (old != floor) > + now = old; > + } > +retry: > + /* Try to swap the ctime into place. */ > + cur = cmpxchg(&inode->__i_ctime, ctime, now); > + trace_ctime_inode_update(inode, ctime, now, cur); > + > + /* If swap occurred, then we're done */ > + if (cur != ctime) { > + /* > + * Was the change due to someone marking the old ctime QUERIED? > + * If so then retry the swap. This can only happen once since > + * the only way to clear I_CTIME_QUERIED is to stamp the inode > + * with a new ctime. > + */ > + if (!(ctime & I_CTIME_QUERIED) && (ctime | I_CTIME_QUERIED) == cur) { > + ctime = cur; > + goto retry; > + } > + /* Otherwise, take the new ctime */ > + now = cur & ~I_CTIME_QUERIED; > + } > +out: > + return timestamp_truncate(ktime_to_timespec64(now), inode); > } > EXPORT_SYMBOL(inode_set_ctime_current); > > diff --git a/fs/stat.c b/fs/stat.c > index 6f65b3456cad..7e9bd16b553b 100644 > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -22,10 +22,39 @@ > > #include <linux/uaccess.h> > #include <asm/unistd.h> > +#include <trace/events/timestamp.h> > > #include "internal.h" > #include "mount.h" > > +/** > + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED > + * @stat: where to store the resulting values > + * @request_mask: STATX_* values requested > + * @inode: inode from which to grab the c/mtime > + * > + * Given @inode, grab the ctime and mtime out if it and store the result > + * in @stat. When fetching the value, flag it as queried so the next write > + * will ensure a distinct timestamp. > + */ > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode) > +{ > + atomic_long_t *pc = (atomic_long_t *)&inode->__i_ctime; > + > + /* If neither time was requested, then don't report them */ > + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) { > + stat->result_mask &= ~(STATX_CTIME|STATX_MTIME); > + return; > + } > + > + stat->mtime.tv_sec = inode->i_mtime_sec; > + stat->mtime.tv_nsec = inode->i_mtime_nsec; > + stat->ctime = ktime_to_timespec64(atomic_long_fetch_or(I_CTIME_QUERIED, pc) & > + ~I_CTIME_QUERIED); > + trace_fill_mg_cmtime(inode, atomic_long_read(pc)); > +} > +EXPORT_SYMBOL(fill_mg_cmtime); > + > /** > * generic_fillattr - Fill in the basic attributes from the inode struct > * @idmap: idmap of the mount the inode was found from > @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask, > stat->rdev = inode->i_rdev; > stat->size = i_size_read(inode); > stat->atime = inode_get_atime(inode); > - stat->mtime = inode_get_mtime(inode); > - stat->ctime = inode_get_ctime(inode); > + > + if (is_mgtime(inode)) { > + fill_mg_cmtime(stat, request_mask, inode); > + } else { > + stat->ctime = inode_get_ctime(inode); > + stat->mtime = inode_get_mtime(inode); > + } > + > stat->blksize = i_blocksize(inode); > stat->blocks = inode->i_blocks; > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 4b10db12725d..5694cb6c4dc2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1608,6 +1608,23 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode, > return inode_set_mtime_to_ts(inode, ts); > } > > +/* > + * Multigrain timestamps > + * > + * Conditionally use fine-grained ctime and mtime timestamps when there > + * are users actively observing them via getattr. The primary use-case > + * for this is NFS clients that use the ctime to distinguish between > + * different states of the file, and that are often fooled by multiple > + * operations that occur in the same coarse-grained timer tick. > + * > + * We use the least significant bit of the ktime_t to track the QUERIED > + * flag. This means that filesystems with multigrain timestamps effectively > + * have 2ns resolution for the ctime, even if they advertise 1ns s_time_gran. > + */ > +#define I_CTIME_QUERIED (1LL) > + > +static inline bool is_mgtime(const struct inode *inode); > + > struct timespec64 inode_get_ctime(const struct inode *inode); > struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts); > > @@ -2477,6 +2494,7 @@ struct file_system_type { > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */ > #define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */ > +#define FS_MGTIME 64 /* FS uses multigrain timestamps */ > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ > int (*init_fs_context)(struct fs_context *); > const struct fs_parameter_spec *parameters; > @@ -2500,6 +2518,17 @@ struct file_system_type { > > #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME) > > +/** > + * is_mgtime: is this inode using multigrain timestamps > + * @inode: inode to test for multigrain timestamps > + * > + * Return true if the inode uses multigrain timestamps, false otherwise. > + */ > +static inline bool is_mgtime(const struct inode *inode) > +{ > + return inode->i_sb->s_type->fs_flags & FS_MGTIME; > +} > + > extern struct dentry *mount_bdev(struct file_system_type *fs_type, > int flags, const char *dev_name, void *data, > int (*fill_super)(struct super_block *, void *, int)); > @@ -3234,6 +3263,7 @@ extern void page_put_link(void *); > extern int page_symlink(struct inode *inode, const char *symname, int len); > extern const struct inode_operations page_symlink_inode_operations; > extern void kfree_link(void *); > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode); > void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *); > void generic_fill_statx_attr(struct inode *inode, struct kstat *stat); > extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int); > diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h > index 35ff875d3800..1f71738aa38c 100644 > --- a/include/trace/events/timestamp.h > +++ b/include/trace/events/timestamp.h > @@ -8,6 +8,78 @@ > #include <linux/tracepoint.h> > #include <linux/fs.h> > > +TRACE_EVENT(ctime_floor_update, > + TP_PROTO(struct inode *inode, > + ktime_t old, > + ktime_t new, > + ktime_t cur), > + > + TP_ARGS(inode, old, new, cur), > + > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(ino_t, ino) > + __field(ktime_t, old) > + __field(ktime_t, new) > + __field(ktime_t, cur) > + ), > + > + TP_fast_assign( > + __entry->dev = inode->i_sb->s_dev; > + __entry->ino = inode->i_ino; > + __entry->old = old; > + __entry->new = new; > + __entry->cur = cur; > + ), > + > + TP_printk("ino=%d:%d:%lu old=%llu.%lu new=%llu.%lu cur=%llu.%lu swp=%c", > + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, > + ktime_to_timespec64(__entry->old).tv_sec, > + ktime_to_timespec64(__entry->old).tv_nsec, > + ktime_to_timespec64(__entry->new).tv_sec, > + ktime_to_timespec64(__entry->new).tv_nsec, > + ktime_to_timespec64(__entry->cur).tv_sec, > + ktime_to_timespec64(__entry->cur).tv_nsec, > + (__entry->old == __entry->cur) ? 'Y' : 'N' > + ) > +); > + > +TRACE_EVENT(ctime_inode_update, > + TP_PROTO(struct inode *inode, > + ktime_t old, > + ktime_t new, > + ktime_t cur), > + > + TP_ARGS(inode, old, new, cur), > + > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(ino_t, ino) > + __field(ktime_t, old) > + __field(ktime_t, new) > + __field(ktime_t, cur) > + ), > + > + TP_fast_assign( > + __entry->dev = inode->i_sb->s_dev; > + __entry->ino = inode->i_ino; > + __entry->old = old; > + __entry->new = new; > + __entry->cur = cur; > + ), > + > + TP_printk("ino=%d:%d:%ld old=%llu.%ld new=%llu.%ld cur=%llu.%ld swp=%c", > + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, > + ktime_to_timespec64(__entry->old).tv_sec, > + ktime_to_timespec64(__entry->old).tv_nsec, > + ktime_to_timespec64(__entry->new).tv_sec, > + ktime_to_timespec64(__entry->new).tv_nsec, > + ktime_to_timespec64(__entry->cur).tv_sec, > + ktime_to_timespec64(__entry->cur).tv_nsec, > + (__entry->old == __entry->cur ? 'Y' : 'N') > + ) > +); > + > TRACE_EVENT(inode_needs_update_time, > TP_PROTO(struct inode *inode, > struct timespec64 *now, > @@ -70,6 +142,31 @@ TRACE_EVENT(inode_set_ctime_to_ts, > __entry->ts_sec, __entry->ts_nsec > ) > ); > + > +TRACE_EVENT(fill_mg_cmtime, > + TP_PROTO(struct inode *inode, > + ktime_t ctime), > + > + TP_ARGS(inode, ctime), > + > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(ino_t, ino) > + __field(ktime_t, ctime) > + ), > + > + TP_fast_assign( > + __entry->dev = inode->i_sb->s_dev; > + __entry->ino = inode->i_ino; > + __entry->ctime = ctime; > + ), > + > + TP_printk("ino=%d:%d:%ld ctime=%llu.%lu", > + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, > + ktime_to_timespec64(__entry->ctime).tv_sec, > + ktime_to_timespec64(__entry->ctime).tv_nsec > + ) > +); > #endif /* _TRACE_TIMESTAMP_H */ > > /* This part must be outside protection */ > > -- > 2.45.2 > > -- Chuck Lever