Re: Testing if two open descriptors refer to the same inode

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 30 Jul 2024 12:31:57 +1000

On Mon, Jul 29, 2024 at 09:36:01AM -0400, Theodore Ts'o wrote:
> On Mon, Jul 29, 2024 at 12:18:15PM +0200, Mateusz Guzik wrote:
> > 
> > Are you claiming on-disk inode numbers are not guaranteed unique per
> > filesystem? It sounds like utter breakage, with capital 'f'.
> 
> The reality is that there exists file systems which do not return
> unique inode numbers.  For example, there are virtiofs implementations
> which pass the inode numbers straight through with a fixed dev_t.  If
> you have a large number of packages mounted via iscsi, and those
> packages include shared libraries, then you can have two different
> shared libraries with the same inode number, and then you can watch
> the dynamic liunker get Very Confused, and debugging the problem can
> be.... interesting.  (Three gueses how I found out about this, and the
> first two don't count.  Yes, we figured out a workaround.)
> 
> So that breakage exists already, today.
> 
> For people who don't like this, they can stick to those file systems
> that still guarantee unique inode numbers, at least for local disk
> file systems --- for example, to use ext4 and xfs, over btrfs and
> bcachefs.

I don't think you can make such a simplistic delineation, because
there's more than one issue at play here.

There are at least two different "is this inode identical"
use cases that {st_dev,st_ino} is being used for.

The first, as Florian described, is to determine if two open fds
refer to the same inode for collision avoidance.

This works on traditional filesystems like ext4 and XFS, but isn't
reliable on filesystems with integrated snapshot/subvolume
functionality.

The second is that {dev,ino} is being used to disambiguate paths
that point to hardlinked inodes for the purposes of identifying
and optimising access and replication of shared (i.e. non-unique)
file data.

This works on traditional filesystems like ext4, but fails badly on
filesystem that support FICLONERANGE (XFS, btrfs, NFS, CIFS,
bcachefs, etc) because cloned files have unique inodes but
non-unique data.

> However, this is a short-term expedient, and in the long term, we will
> need to guide userspace to use something that is more likely to work,
> such as file handles.

The first case can be done with filehandles - it's a simple
resolution of fds to filehandles and compare the opaque filehandles.
That's the short-term solution because it's the easy one to solve.

However, filehandles do not work for the solving the second case.

Hardlinks are basically a mechanism for avoiding data copying within
the same filesystem.  i.e. hardlink disambiguation is essentially
the process of detecting which dirents point to the same shared
data. We can detect a hardlinked inode simply by looking at the
link count, and we use the inode number to determine that they point
to the same physical storage.

Applications like tar and rsync are detecting hard links to avoid
two main issues:

	- moving the same data multiple times
	- causing destination write amplification by storing the
	  same data in multiple places

They avoid these by creating a hardlink map of the directory
structure being processed, and then recreate that hardlink map at
the destination.  We could use filehandles for that, too, and then
we wouldn't be relying on {dev,ino} for this, either.

However, any application that is using inode number or filehandle
comparisons to detect data uniqueness does not work if other
applications and utilities are using reflink copies rather than
hardlinks for space efficient data copying.

Let's all keep this in mind here: the default behaviour of `cp` is
to use file clones on filesystems that support them over physical
data copies. I have maybe half a dozen hardlinks in most of my local
XFS filesystems, but I have many tens of thousands of cloned files
in those same filesystems.

IOWs, any tool that is using {dev,ino} as a proxy for data
uniqueness is fundamentally deficient on any filesystem that
supports file cloning. 

Given the potential for badness in replicating filesystems full of
cloned data, it's far more important and higher priority for such
utilities to move away from using {dev,ino} to detect data
uniqueness. Handling cloned data efficiently requires this, and
that's a far better reason for moving away from {dev,ino} based
disambiguation than "oh, it doesn't work on btrfs properly".

Detecting "is the data unique" is not that hard - e.g. we could
add a statx flag to say "this inode has shared data" - and then
userspace can tag that inode as needing data disambiguation before
starting to move data.

However, data disambiguation (i.e. finding what inodes share the
data at which file offset) is a much harder problem. This largely
requires knowledge of the entire layout of the filesystem, and so
it's really only a question the filesystem itself can resolve.

We already provide such an interface for XFS with ioctl(GETFSMAP).
It is driven by the on-disk reverse mapping btrees, and so can
quickly answer the "what {inode,offset,len} tuples share this
physical extent" question. The interface is generic, however, so
asking such a question and determining the answer is .... complex.

That is our long term challenge: replacing the use of {dev,ino} for
data uniqueness disambiguation. Making the identification of owners
of non-unique/shared data simple for applications to use and fast
for filesystems to resolve will be a challenge.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx