Re: replacement i_version counter for xfs

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 30 Jan 2023 12:54:13 +1100

On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > IIUC the rest of the justification for i_version is that ctime might
> > lack the timestamp granularity to disambiguate sub-timestamp
> > granularity changes, so i_version is needed to bridge that gap.
> > 
> > Given that XFS has nanosecond timestamp resolution in the on-disk
> > format, both i_version and ctime changes are journalled, and
> > ctime/i_version will always change at exactly the same time in the
> > same transactions, there are no inherent sub-timestamp granularity
> > problems with ctime within XFS. Any deficiency in ctime resolution
> > comes solely from the granularity of the VFS inode timestamp
> > functions.
> > 
> > And so if current_time() was to provide fine-grained nanosecond
> > timestamp resolution for exported XFS filesystems (i.e. use
> > ktime_get_real_ts64() conditionally), then it seems to me that the
> > nfsd i_version function becomes completely redundant.
> > 
> > i.e. we are pretty much guaranteed that ctime on exported
> > filesystems will always be different for explicit modifications to
> > the same inode, and hence we can just use ctime as the version
> > change identifier without needing any on-disk format changes at all.
> > 
> > And we can optimise away that overhead when the filesystem is not
> > exported by just using the coarse timestamps because there is no
> > need for sub-timer-tick disambiguation of single file
> > modifications....
> > 
> 
> Ok, so conditional on (maybe) a per fstype flag, and whether the
> filesystem is exported?

Not sure why a per-fstype flag is necessary?

> 
> It's not trivial to tell whether something is exported though. We
> typically only do that sort of checking within nfsd. That involves an
> upcall into mountd, at a minimum.
> 
> I don't think you want to be plumbing calls to exportfs into xfs for
> this. It may be simpler to just add a new on-disk counter and be done
> with it.

I didn't ever expect for XFS to have to be aware of the fact that a
user has exported the filesystem. If "filesystem has been exported"
tracking is required, then we add a flag/counter to the superblock,
and the NFSd subsystem updates the counter/flag when it is informed
that part of the filesystem has been exported/unexported.

The NFSd/export subsystem is pinning filesystems in memory when they
are exported. This is evident by the fact we cannot unmount an
exported filesystem - it has to be unexported before it can be
unmounted. I suspect that it's the ex_path that is stored in the
svc_export structure, because stuff like this is done in the
filehandle code:

static bool is_root_export(struct svc_export *exp)
{
        return exp->ex_path.dentry == exp->ex_path.dentry->d_sb->s_root;
}

static struct super_block *exp_sb(struct svc_export *exp)
{
        return exp->ex_path.dentry->d_sb;
}

i.e. the file handle code assumes the existence of a pinned path
that is the root of the exported directory tree. This points to the
superblock behind the export so that it can do stuff like pull the
device numbers, check sb->s_type->fs_flags fields (e.g
FS_REQUIRES_DEV), etc as needed to encode/decode/verify filehandles.

Somewhere in the code this path must be pinned to the svc_export for
the life of the svc_export (svc_export_init()?), and at that point
the svc_export code could update the struct super_block state
appropriately....

> > Hence it appears to me that with the new i_version specification
> > that there's an avenue out of this problem entirely that is "nfsd
> > needs to use ctime, not i_version". This solution seems generic
> > enough that filesystems with existing on-disk nanosecond timestamp
> > granularity would no longer need explicit on-disk support for the
> > nfsd i_version functionality, yes?
> > 
> 
> Pretty much.
> 
> My understanding has always been that it's not the on-disk format that's
> the limiting factor, but the resolution of in-kernel timestamp sources.
> If ktime_get_real_ts64 has real ns granularity, then that should be
> sufficient (at least for the moment). I'm unclear on the performance
> implications with such a change though.

It's indicated in the documentation:

"These are only useful when called in a fast path and one still
expects better than second accuracy, but can't easily use 'jiffies',
e.g. for inode timestamps.  Skipping the hardware clock access saves
around 100 CPU cycles on most modern machines with a reliable cycle
counter, but up to several microseconds on older hardware with an
external clocksource."

For a modern, high performance machine, 100 CPU cycles for the cycle
counter read is less costly than a pipeline stall due to a single
cacheline miss. For older, slower hardware, the transaction overhead
of a ctime update is already in the order of microseconds, so this
amount of overhead still isn't a show stopper.

As it is, optimising the timestamp read similar to the the iversion
bump only after it has been queried (i.e. equivalent of
inode_maybe_inc_iversion()) would remove most of the additional
overhead for non-NFSd operations. It could be done simply using
an inode state flag rather than hiding the state bit in the
i_version value...

> You had also mentioned a while back that there was some desire for
> femtosecond resolution on timestamps. Does that change the calculus here
> at all? Note that the i_version is not subject to any timestamp
> granularity issues.

[ Reference from 2016 on femptosecond granularity timestamps in
statx():

https://lore.kernel.org/linux-fsdevel/20161117234047.GE28177@dastard/

where I ask:

"So what happens in ten years time when we want to support
femptosecond resolution in the timestamp interface?" ]

Timestamp granularity changes will require an on-disk format change
regardless of anything else. We have no plans to do this again in
the near future - we've just done a revision for >y2038 timestamp
support in the on disk format, and we'd have to do another to
support sub-nanosecond timestamp granularity.  Hence we know exactly
how much time, resources and testing needs to be put into changing
the on-disk timestamp format.  Given that adding a new i_version
field is of similar complexity, we don't want to do either if we can
possibly avoid it.

Looking from a slightly higher perspective, in XFS timestamp updates
are done under exclusive inode locks and so the entire transaction
will need to be done in sub-nanosecond time before we need to worry
about timestamp granularity. It's going to be a long, long time into
the future before that ever happens (if ever!), so I don't think we
need to change the on-disk timestamp granularity to disambiguate
individual inode metadata/data changes any time soon.

> If you want nfsd to start using the ctime for i_version with xfs, then
> you can just turn off the SB_I_IVERSION flag. You will need to do some
> work though to keep your "special" i_version that also counts atime
> updates working once you turn that off. You'll probably want to do that
> anyway though since the semantics for xfs's version counter are
> different from everyone else's.

XFS already uses ->update_time because it is different to other
filesystems in that pure timestamp updates are always updated
transactionally rather than just updating the inode in memory. I'm
not sure there's anything we need to change there right now.

Other things would need to change if we don't set SB_I_IVERSION -
we'd unconditionally bump i_version in xfs_trans_log_inode() rather
than forcing it through inode_maybe_inc_iversion() because we no
longer want the VFS or applications to modify i_version.

But we do still want to tell external parties that i_version is a
usable counter that can be used for data and metadata change
detection, and that stands regardless of the fact that the NFSd
application doesn't want fine-grained change detection anymore.
Hence I think whatever gets done needs to be more nuanced than
"SB_I_VERSION really only means NFSD_CAN_USE_I_VERSION". Perhaps
a second flag that says "SB_I_VERSION_NFSD_COMPATIBLE" to
differentiate between the two cases?

[ Reflection on terminology: how denigrating does it appear when
something is called "special" because it is different to others?
Such terminology would never be allowed if we were talking about
differences between people. ]

> If this is what you choose to do for xfs, then the question becomes: who
> is going to do that timestamp rework?

Depends on what ends up needing to be changed, I'm guessing...

> Note that there are two other lingering issues with i_version. Neither
> of these are xfs-specific, but they may inform the changes you want to
> make there:
> 
> 1/ the ctime and i_version can roll backward on a crash.

Yup, because async transaction engines allow metadata to appear
changed in memory before it is stable on disk. No difference between
i_version or ctime here at all.

> 2/ the ctime and i_version are both currently updated before write data
> is copied to the pagecache. It would be ideal if that were done
> afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> for this, but they need a lot more testing.)

AFAICT, changing the i_version after the page cache has been written
to does not fix the visibility and/or crash ordering issue.  The new
data is published for third party access when the folio the data is
written into is unlocked, not when the entire write operation
completes.  Hence we can start writeback on data that is part of a
write operation before the write operation has completed and updated
i_version.

If we then crash before the write into the page cache completes and
updates i_version, we can now have changed data on disk without a
i_version update in the metadata to even recover from the journal.
Hence with a post-write i_version update, none of the clients will
see that their caches are stale because i_version/ctime hasn't
changed despite the data on disk having changed.

As such, I don't really see what "update i_version after page cache
write completion" actually achieves by itself.  Even "update
i_version before and after write" doesn't entirely close down the
crash hole in i_version, either - it narrows it down, but it does
not remove it...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx