Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Mon, 2 Aug 2021 18:14:34 -0400

On Tue, Aug 03, 2021 at 07:59:30AM +1000, NeilBrown wrote:
> On Tue, 03 Aug 2021, J. Bruce Fields wrote:
> > On Tue, Aug 03, 2021 at 07:10:44AM +1000, NeilBrown wrote:
> > > On Mon, 02 Aug 2021, J. Bruce Fields wrote:
> > > > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > > > > For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> > > > > the inode should become (file.objectid ^ swab64(root.objectid)).  This
> > > > > will provide numbers that are unique until you get very large subvols,
> > > > > and very many subvols.
> > > > 
> > > > If you snapshot a filesystem, I'd expect, at least by default, that
> > > > inodes in the snapshot to stay the same as in the snapshotted
> > > > filesystem.
> > > 
> > > As I said: we need to challenge and revise user-space (and meat-space)
> > > expectations. 
> > 
> > The example that came to mind is people that export a snapshot, then
> > replace it with an updated snapshot, and expect that to be transparent
> > to clients.
> > 
> > Our client will error out with ESTALE if it notices an inode number
> > changed out from under it.
> 
> Will it?

See fs/nfs/inode.c:nfs_check_inode_attributes():

	if (nfsi->fileid != fattr->fileid) {
                /* Is this perhaps the mounted-on fileid? */
                if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
                    nfsi->fileid == fattr->mounted_on_fileid)
                        return 0;
                return -ESTALE;
        }

--b.

> If the inode number changed, then the filehandle would change.
> Unless the filesystem were exported with subtreecheck, the old filehandle
> would continue to work (unless the old snapshot was deleted).  File-name
> lookups from the root would find new files...
> 
> "replace with an updated snapshot" is no different from "replace with an
> updated directory tree".  If you delete the old tree, then
> currently-open files will break.  If you don't you get a reasonably
> clean transition.
> 
> > 
> > I don't know if there are other such cases.  It seems like surprising
> > behavior to me, though.
> 
> If you refuse to risk breaking anything, then you cannot make progress.
> Providing people can choose when things break, and have advanced
> warning, they often cope remarkable well.
> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > --b.
> > 
> > > In btrfs, you DO NOT snapshot a FILESYSTEM.  Rather, you effectively
> > > create a 'reflink' for a subtree (only works on subtrees that have been
> > > correctly created with the poorly named "btrfs subvolume" command).
> > > 
> > > As with any reflink, the original has the same inode number that it did
> > > before, the new version has a different inode number (though in current
> > > BTRFS, half of the inode number is hidden from user-space, so it looks
> > > like the inode number hasn't changed).
> > 
> >