Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 13 Dec 2023 09:10:32 +1100

On Tue, Dec 12, 2023 at 04:23:06PM -0500, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 07:48:16AM +1100, Dave Chinner wrote:
> > On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> > > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > > > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > > > the middle of the room?
> > > > 
> > > > I mean, we already have name_to_handle_at() for userspace to get a
> > > > unique, opaque, filesystem defined file handle for any given file.
> > > > It's the same filehandle that filesystems hand to the nfsd so nfs
> > > > clients can uniquely identify the file they are asking the nfsd to
> > > > operate on.
> > > > 
> > > > The contents of these filehandles is entirely defined by the file
> > > > system and completely opaque to the user. The only thing that
> > > > parses the internal contents of the handle is the filesystem itself.
> > > > Therefore, as long as the fs encodes the information it needs into the
> > > > handle to determine what subvol/snapshot the inode belongs to when
> > > > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > > > nothing else needs to care how it is encoded.
> > > > 
> > > > So can someone please explain to me why we need to try to re-invent
> > > > a generic filehandle concept in statx when we already have a
> > > > have working and widely supported user API that provides exactly
> > > > this functionality?
> > > 
> > > Definitely should be part of the discussion :)
> > > 
> > > But I think it _does_ need to be in statx; because:
> > >  - we've determined that 64 bit ino_t just isn't a future proof
> > >    interface, we're having real problems with it today
> > >  - statx is _the_ standard, future proofed interface for getting inode
> > >    attributes
> > 
> > No, it most definitely isn't, and statx was never intended as a
> > dumping ground for anything and everything inode related. e.g. Any
> > inode attribute that can be modified needs to use a different
> > interface - one that has a corresponding "set" operation.
> 
> And here I thought the whole point of statx was to be an extensible way
> to request any sort of inode attribute.

Within reason. When the size of a single attribute is dynamically
sized, can double the size of the statx structure and there's an
existing syscall to retrieve that information from the filesystem,
it makes no sense *at all* to duplicate that information in statx().

statx() is for new attributes we don't have any other way of exposing to
userspace easily. it is designed for fixed size attributes and
flags, it is not designed for dynamically sized information to be
embedded in struct statx.

Filehandles are not new, and we already have APIs that expose them
to userspace that handle the dynamic size of filehandles just fine.
This functionality simply does not belong in statx() at all.

> > >  - therefore, if we want userspace programmers to be using filehandles,
> > >    instead of inode numbers, so there code isn't broken, we need to be
> > >    providing interfaces that guide them in that direction.
> > 
> > We already have a filehandle interface they can use for this
> > purpose. It is already used by some userspace applications for this
> > purpose.
> > 
> > Anything new API function do with statx() will require application
> > changes, and the vast majority of applications aren't using statx()
> > directly - they are using stat() which glibc wraps to statx()
> > internally. So they are going to need a change of API, anyway.
> > 
> > So, fundamentally, there is a change of API for most applications
> > that need to do thorough inode uniqueness checks regardless of
> > anything else. They can do this right now - just continue using
> > stat() as they do right now, and then use name_to_filehandle_at()
> > for uniqueness checks.
> > 
> > > Even assuming we can update all the documentation to say "filehandles
> > > are the correct way to test inode uniqueness", you know at least half of
> > > programmers will stick to stx_ino instead of the filehandle if the
> > > filehandle is an extra syscall.
> > 
> > Your argument is "programmers suck so we must design for the
> > lowest common denominator". That's an -awful- way to design APIs.
> 
> No, I'm saying if the old way doing things no longer works, we ought to
> make the new future proofed way as ergonomic and easy to use as the old
> way was - else it won't get used.
> 
> At the _very_ least we need to add a flag to statx for "inode number is
> unreliable for uniqueness checks".
> 
> bcachefs could leave this off until the first snapshot has been taken.
> 
> But even with that option, I think we ought to be telling anyone doing
> uniqueness checks to use the filehandle, because it includes i_generation.
> 
> > Further, this "programmers suck" design comes at a cost to every
> > statx() call that does not need filehandles. That's the vast
> > majority of statx() calls that are made on a system. Why should we
> > slow down statx() for all users when so few applications actually
> > need uniqueness and they can take the cost of robust uniqueness
> > tests with an extra syscall entirely themselves?
> 
> For any local filesystem the filehandle is going to be the inode
> generation number, the subvolume ID (if applicable), and the inode
> number. That's 16 bytes on bcachefs, and if we make an attempt to
> standardize how this works across filesystems we should be able to do it
> without adding a new indirect function call to the statx() path. That
> sounds pretty negligable to me.

Filehandle encoding is already standardised.

exportfs_encode_fh() does everything needed already, including
providing the default EXPORT_FH_FID behaviour
(exportfs_encode_ino64_fid()) for filesystems that don't otherwise
support filehandles.

e.g. this will call into bcachefs and run bch2_encode_fh() to
calculate the filehandle for bcachefs.  It will call into btrfs
and run btrfs_encode_fh() to encode the file handle. The
infrastructure for encoding filehandles in a consistent, common
manner across all filesystems in the kernel is already present.

What you are suggesting is that we now duplicate filehandle encoding
into every filesystem's statx() implementation.  That's a bad
trade-off from a maintenance, testing and consistency POV because
now we end up with lots of individual, filehandle encoding
implementations in addition to the generic filehandle
infrastructure that we all have to test and validate.

> The syscall overhead isn't going to be negligable when an application
> has to do lots of scanning to find the files its interested - rsync.

Similarly, the extra syscall overhead for calling
name_to_handle_at() will be neglible for applications that need
robust inode uniqueness detection....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx