Re: [PATCH v4 2/9] fs: add infrastructure for multigrain inode i_m/ctime

Jan Kara <jack@xxxxxxx> · Tue, 23 May 2023 14:46:06 +0200

On Tue 23-05-23 06:40:08, Jeff Layton wrote:
> On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > ctime and mtime after a change. This has the benefit of allowing
> > > filesystems to optimize away a lot metadata updates, down to around 1
> > > per jiffy, even when a file is under heavy writes.
> > > 
> > > Unfortunately, this has always been an issue when we're exporting via
> > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > lot of exported filesystems don't properly support a change attribute
> > > and are subject to the same problems with timestamp granularity. Other
> > > applications have similar issues (e.g backup applications).
> > > 
> > > Switching to always using fine-grained timestamps would improve the
> > > situation, but that becomes rather expensive, as the underlying
> > > filesystem will have to log a lot more metadata updates.
> > > 
> > > What we need is a way to only use fine-grained timestamps when they are
> > > being actively queried.
> > > 
> > > The kernel always stores normalized ctime values, so only the first 30
> > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > ctime must also change.
> > > 
> > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > on the next timestamp update, the kernel can fetch a fine-grained
> > > timestamp instead of the usual coarse-grained one.
> > > 
> > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > 
> > > Later patches will convert individual filesystems over to use it.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > 
> > So there are two things I dislike about this series because I think they
> > are fragile:
> > 
> > 1) If we have a filesystem supporting multigrain ts and someone
> > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > should rename i_ctime to something like __i_ctime and always use accessor
> > function for it.
> > 
> 
> We could do this, but it'll be quite invasive. We'd have to change any
> place that touches i_ctime (and there are a lot of them), even on
> filesystems that are not being converted.

Yes, that's why I suggested Coccinelle to deal with this.

> > 2) As I already commented in a previous version of the series, the scheme
> > with just one flag for both ctime and mtime and flag getting cleared in
> > current_time() relies on the fact that filesystems always do an equivalent
> > of:
> > 
> > 	inode->i_mtime = inode->i_ctime = current_time();
> > 
> > Otherwise we can do coarse grained update where we should have done a fine
> > grained one. Filesystems often update timestamps like this but not
> > universally. Grepping shows some instances where only inode->i_mtime is set
> > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > easy to make and results in subtle issues. I think this would be also
> > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> >
> > I understand this is quite some churn but a very mechanical one that could
> > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > the more robust result.
> 
> AFAICT, under POSIX, you must _always_ set the ctime when you set the
> mtime, but the reverse is not true. That's why keeping the flag in the
> ctime makes sense. If we're updating the mtime, then we necessarily must
> update the ctime.

Yes, but technically speaking:

	inode->i_mtime = current_time();
	inode->i_ctime = current_time();

follows these requirements but is broken with your changes in unobvious
way. And if mtime update is hidden in some function or condition, it is not
even that suboptimal coding pattern.

> > > +}
> > > +
> > > +/**
> > > + * ctime_peek - peek at (but don't query) the ctime
> > > + * @inode: inode to fetch the ctime from
> > > + *
> > > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> > > + * use by internal consumers that don't require a fine-grained update on
> > > + * the next change.
> > > + *
> > > + * This is safe to call regardless of whether the underlying filesystem
> > > + * is using multigrain timestamps.
> > > + */
> > > +static inline struct timespec64 ctime_peek(const struct inode *inode)
> > > +{
> > > +	struct timespec64 ctime;
> > > +
> > > +	ctime.tv_sec = inode->i_ctime.tv_sec;
> > > +	ctime.tv_nsec = ctime_nsec_peek(inode);
> > > +
> > > +	return ctime;
> > > +}
> > 
> > Given this is in a header that gets included in a lot of places, maybe we
> > should call it like inode_ctime_peek() or inode_ctime_get() to reduce
> > chances of a name clash?
> 
> I'd be fine with that, but "ctime" sort of implies inode->i_ctime to me.
> We don't really use that nomenclature elsewhere.

Yes, I don't insist here but we do have 'ctime' e.g. in IPC code
(sem_ctime, shm_ctime, msg_ctime), we have ctime in md layer, ctime(3) is
also a glibc function. So it is not *completely* impossible the clash
happens but we can deal with it when it happens.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR