On Wed, Aug 02, 2023 at 04:54:09PM -0400, Jeff Layton wrote: > On Wed, 2023-08-02 at 21:35 +0200, Jan Kara wrote: > > On Tue 25-07-23 10:58:15, Jeff Layton wrote: > > > The VFS always uses coarse-grained timestamps when updating the ctime > > > and mtime after a change. This has the benefit of allowing filesystems > > > to optimize away a lot metadata updates, down to around 1 per jiffy, > > > even when a file is under heavy writes. > > > > > > Unfortunately, this has always been an issue when we're exporting via > > > NFSv3, which relies on timestamps to validate caches. A lot of changes > > > can happen in a jiffy, so timestamps aren't sufficient to help the > > > client decide to invalidate the cache. Even with NFSv4, a lot of > > > exported filesystems don't properly support a change attribute and are > > > subject to the same problems with timestamp granularity. Other > > > applications have similar issues with timestamps (e.g backup > > > applications). > > > > > > If we were to always use fine-grained timestamps, that would improve the > > > situation, but that becomes rather expensive, as the underlying > > > filesystem would have to log a lot more metadata updates. > > > > > > What we need is a way to only use fine-grained timestamps when they are > > > being actively queried. > > > > > > POSIX generally mandates that when the the mtime changes, the ctime must > > > also change. The kernel always stores normalized ctime values, so only > > > the first 30 bits of the tv_nsec field are ever used. > > > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > > has queried the inode for the mtime or ctime. When this flag is set, > > > on the next mtime or ctime update, the kernel will fetch a fine-grained > > > timestamp instead of the usual coarse-grained one. > > > > > > Filesytems can opt into this behavior by setting the FS_MGTIME flag in > > > the fstype. Filesystems that don't set this flag will continue to use > > > coarse-grained timestamps. > > > > > > Later patches will convert individual filesystems to use the new > > > infrastructure. > > > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > > --- > > > fs/inode.c | 98 ++++++++++++++++++++++++++++++++++++++---------------- > > > fs/stat.c | 41 +++++++++++++++++++++-- > > > include/linux/fs.h | 45 +++++++++++++++++++++++-- > > > 3 files changed, 151 insertions(+), 33 deletions(-) > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > index d4ab92233062..369621e7faf5 100644 > > > --- a/fs/inode.c > > > +++ b/fs/inode.c > > > @@ -1919,6 +1919,21 @@ int inode_update_time(struct inode *inode, struct timespec64 *time, int flags) > > > } > > > EXPORT_SYMBOL(inode_update_time); > > > > > > +/** > > > + * current_coarse_time - Return FS time > > > + * @inode: inode. > > > + * > > > + * Return the current coarse-grained time truncated to the time > > > + * granularity supported by the fs. > > > + */ > > > +static struct timespec64 current_coarse_time(struct inode *inode) > > > +{ > > > + struct timespec64 now; > > > + > > > + ktime_get_coarse_real_ts64(&now); > > > + return timestamp_truncate(now, inode); > > > +} > > > + > > > /** > > > * atime_needs_update - update the access time > > > * @path: the &struct path to update > > > @@ -1952,7 +1967,7 @@ bool atime_needs_update(const struct path *path, struct inode *inode) > > > if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)) > > > return false; > > > > > > - now = current_time(inode); > > > + now = current_coarse_time(inode); > > > > > > if (!relatime_need_update(mnt, inode, now)) > > > return false; > > > @@ -1986,7 +2001,7 @@ void touch_atime(const struct path *path) > > > * We may also fail on filesystems that have the ability to make parts > > > * of the fs read only, e.g. subvolumes in Btrfs. > > > */ > > > - now = current_time(inode); > > > + now = current_coarse_time(inode); > > > inode_update_time(inode, &now, S_ATIME); > > > __mnt_drop_write(mnt); > > > skip_update: > > > > There are also calls in fs/smb/client/file.c:cifs_readpage_worker() and in > > fs/ocfs2/file.c:ocfs2_update_inode_atime() that should probably use > > current_coarse_time() to avoid needless querying of fine grained > > timestamps. But see below... > > > > Technically, they already devolve to current_coarse_time anyway, but > changing them would allow them to skip the fstype flag check, but I like > your idea below better anyway. > > > > @@ -2072,6 +2087,56 @@ int file_remove_privs(struct file *file) > > > } > > > EXPORT_SYMBOL(file_remove_privs); > > > > > > +/** > > > + * current_mgtime - Return FS time (possibly fine-grained) > > > + * @inode: inode. > > > + * > > > + * Return the current time truncated to the time granularity supported by > > > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > > > + * as having been QUERIED, get a fine-grained timestamp. > > > + */ > > > +static struct timespec64 current_mgtime(struct inode *inode) > > > +{ > > > + struct timespec64 now; > > > + atomic_long_t *pnsec = (atomic_long_t *)&inode->__i_ctime.tv_nsec; > > > + long nsec = atomic_long_read(pnsec); > > > + > > > + if (nsec & I_CTIME_QUERIED) { > > > + ktime_get_real_ts64(&now); > > > + } else { > > > + struct timespec64 ctime; > > > + > > > + ktime_get_coarse_real_ts64(&now); > > > + > > > + /* > > > + * If we've recently fetched a fine-grained timestamp > > > + * then the coarse-grained one may still be earlier than the > > > + * existing one. Just keep the existing ctime if so. > > > + */ > > > + ctime = inode_get_ctime(inode); > > > + if (timespec64_compare(&ctime, &now) > 0) > > > + now = ctime; > > > + } > > > + > > > + return timestamp_truncate(now, inode); > > > +} > > > + > > > +/** > > > + * current_time - Return timestamp suitable for ctime update > > > + * @inode: inode to eventually be updated > > > + * > > > + * Return the current time, which is usually coarse-grained but may be fine > > > + * grained if the filesystem uses multigrain timestamps and the existing > > > + * ctime was queried since the last update. > > > + */ > > > +struct timespec64 current_time(struct inode *inode) > > > +{ > > > + if (is_mgtime(inode)) > > > + return current_mgtime(inode); > > > + return current_coarse_time(inode); > > > +} > > > +EXPORT_SYMBOL(current_time); > > > + > > > > So if you modify current_time() to handle multigrain timestamps the code > > will be still racy. In particular fill_mg_cmtime() can race with > > inode_set_ctime_current() like: > > > > fill_mg_cmtime() inode_set_ctime_current() > > stat->mtime = inode->i_mtime; > > stat->ctime.tv_sec = inode->__i_ctime.tv_sec; > > now = current_time(); > > /* fetches coarse > > * grained timestamp */ > > stat->ctime.tv_nsec = atomic_long_fetch_or(I_CTIME_QUERIED, pnsec) & > > ~I_CTIME_QUERIED; > > inode_set_ctime(inode, now.tv_sec, now.tv_nsec); > > > > and the information about a need for finegrained timestamp update gets > > lost. So what I'd propose is to leave current_time() alone (just always > > reporting coarse grained timestamps) and put all the magic into > > inode_set_ctime_current() only. There we need something like: > > > > struct timespec64 inode_set_ctime_current(struct inode *inode) > > { > > ... variables ... > > > > nsec = READ_ONCE(inode->__i_ctime.tv_nsec); > > if (!(nsec & I_CTIME_QUERIED)) { > > now = current_time(inode); > > > > if (!is_gmtime(inode)) { > > inode_set_ctime_to_ts(inode, now); > > } else { > > /* > > * If we've recently fetched a fine-grained > > * timestamp then the coarse-grained one may still > > * be earlier than the existing one. Just keep the > > * existing ctime if so. > > */ > > ctime = inode_get_ctime(inode); > > if (timespec64_compare(&ctime, &now) > 0) > > now = ctime; > > > > /* > > * Ctime updates are generally protected by inode > > * lock but we could have raced with setting of > > * I_CTIME_QUERIED flag. > > */ > > if (cmpxchg(&inode->__i_ctime.tv_nsec, nsec, > > now.tv_nsec) != nsec) > > goto fine_grained; > > inode->__i_ctime.tv_sec = now.tv_sec; > > } > > return now; > > } > > fine_grained: > > ktime_get_real_ts64(&now); > > inode_set_ctime_to_ts(inode, now); > > > > return now; > > } > > > > Honza > > > > This is a great idea. I'll rework the series along the lines you > suggest. That also answers my earlier question to Christian: > > I'll just resend the whole series (it's not very big anyway), and I'll > include the fill_mg_cmtime prototype change. Ok, sound good!