On Mon, Jul 29, 2024 at 09:36:01AM -0400, Theodore Ts'o wrote: > On Mon, Jul 29, 2024 at 12:18:15PM +0200, Mateusz Guzik wrote: > > > > Are you claiming on-disk inode numbers are not guaranteed unique per > > filesystem? It sounds like utter breakage, with capital 'f'. > > The reality is that there exists file systems which do not return > unique inode numbers. For example, there are virtiofs implementations > which pass the inode numbers straight through with a fixed dev_t. If > you have a large number of packages mounted via iscsi, and those > packages include shared libraries, then you can have two different > shared libraries with the same inode number, and then you can watch > the dynamic liunker get Very Confused, and debugging the problem can > be.... interesting. (Three gueses how I found out about this, and the > first two don't count. Yes, we figured out a workaround.) > > So that breakage exists already, today. > > For people who don't like this, they can stick to those file systems > that still guarantee unique inode numbers, at least for local disk > file systems --- for example, to use ext4 and xfs, over btrfs and > bcachefs. I don't think you can make such a simplistic delineation, because there's more than one issue at play here. There are at least two different "is this inode identical" use cases that {st_dev,st_ino} is being used for. The first, as Florian described, is to determine if two open fds refer to the same inode for collision avoidance. This works on traditional filesystems like ext4 and XFS, but isn't reliable on filesystems with integrated snapshot/subvolume functionality. The second is that {dev,ino} is being used to disambiguate paths that point to hardlinked inodes for the purposes of identifying and optimising access and replication of shared (i.e. non-unique) file data. This works on traditional filesystems like ext4, but fails badly on filesystem that support FICLONERANGE (XFS, btrfs, NFS, CIFS, bcachefs, etc) because cloned files have unique inodes but non-unique data. > However, this is a short-term expedient, and in the long term, we will > need to guide userspace to use something that is more likely to work, > such as file handles. The first case can be done with filehandles - it's a simple resolution of fds to filehandles and compare the opaque filehandles. That's the short-term solution because it's the easy one to solve. However, filehandles do not work for the solving the second case. Hardlinks are basically a mechanism for avoiding data copying within the same filesystem. i.e. hardlink disambiguation is essentially the process of detecting which dirents point to the same shared data. We can detect a hardlinked inode simply by looking at the link count, and we use the inode number to determine that they point to the same physical storage. Applications like tar and rsync are detecting hard links to avoid two main issues: - moving the same data multiple times - causing destination write amplification by storing the same data in multiple places They avoid these by creating a hardlink map of the directory structure being processed, and then recreate that hardlink map at the destination. We could use filehandles for that, too, and then we wouldn't be relying on {dev,ino} for this, either. However, any application that is using inode number or filehandle comparisons to detect data uniqueness does not work if other applications and utilities are using reflink copies rather than hardlinks for space efficient data copying. Let's all keep this in mind here: the default behaviour of `cp` is to use file clones on filesystems that support them over physical data copies. I have maybe half a dozen hardlinks in most of my local XFS filesystems, but I have many tens of thousands of cloned files in those same filesystems. IOWs, any tool that is using {dev,ino} as a proxy for data uniqueness is fundamentally deficient on any filesystem that supports file cloning. Given the potential for badness in replicating filesystems full of cloned data, it's far more important and higher priority for such utilities to move away from using {dev,ino} to detect data uniqueness. Handling cloned data efficiently requires this, and that's a far better reason for moving away from {dev,ino} based disambiguation than "oh, it doesn't work on btrfs properly". Detecting "is the data unique" is not that hard - e.g. we could add a statx flag to say "this inode has shared data" - and then userspace can tag that inode as needing data disambiguation before starting to move data. However, data disambiguation (i.e. finding what inodes share the data at which file offset) is a much harder problem. This largely requires knowledge of the entire layout of the filesystem, and so it's really only a question the filesystem itself can resolve. We already provide such an interface for XFS with ioctl(GETFSMAP). It is driven by the on-disk reverse mapping btrees, and so can quickly answer the "what {inode,offset,len} tuples share this physical extent" question. The interface is generic, however, so asking such a question and determining the answer is .... complex. That is our long term challenge: replacing the use of {dev,ino} for data uniqueness disambiguation. Making the identification of owners of non-unique/shared data simple for applications to use and fast for filesystems to resolve will be a challenge. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx