Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export

"NeilBrown" <neilb@xxxxxxx> · Thu, 02 Sep 2021 14:14:17 +1000

On Thu, 02 Sep 2021, J. Bruce Fields wrote:
> I looked back through a couple threads to try to understand why we
> couldn't do that (on new filesystems, with a mkfs option to choose new
> or old behavior) and still don't understand.  But the threads are long.
> 
> There are objections to a new mount option (which seem obviously wrong;
> this should be a persistent feature of the on-disk filesystem).

I hadn't thought much (if at all) about a persistent filesystem feature
flag.  I'll try that now.

There are two features of interest.  One is completely unique inode
numbers, the other is reporting different st_dev for different
subvolumes.  I think these need to be kept separate, though the second
would depend on the first.  They would be similar to my "inumbits" and
"numdevs" mount options, though with less flexibility.  I think that
they would need strong semantics to be acceptable - "mostly unique"
isn't really acceptable once we are changing the on-disk data.

The "unique inode numbers" bit (UIN) would require that file object-ids
fit in some number of bits (maybe 40) and that subvolume numbers fit in
the remaining bits (24) and would then combine them together for the
inode number.  This could obviously be set at mkfs time.  Could it be
set on an unmounted filesystem?

The "single-dev" flag (SD) could be toggled any time that UIN was set,
and mkfs would default it on if UIN was selected.

If UIN was in effect, then creating a subvol beyond the permitted max
would have to fail.  24 bits is small enough that we would probably want
a warning of impending doom - maybe at 23 bits? The current 48bits
doesn't need that.
Similarly creating an inode beyond 40bits would have to fail.  This is
probably more problematic and so might need more warnings.  Do we want a
warning each time any subvol crosses some limit?  If not we would need a
flag for each warning.

What should a sysadmin do when they see the warning? If 40 bit an
unacceptable limit of the total number of inodes in a subvol, or is it
only a problem because of btrfs' practice of never reusing object-ids?

Backup-and-restore would compact object-ids, but would be a big cost.
Off-line reindexing would be cheaper (does anyone else remember using
"renum" programs with BASIC??).  Online lazy re-indexing might be
possible if the inode number was maintained separately from the
object-id and an atomic "switch which inode number to use" could be done
at mount time.

Setting UIN on an existing filesystem would require checking that only
24bit are used for subvolumes (easy) and that only 40 were usgd for
objects in any individual subvolume (presumably that would require
checking all subvolumes, which might take a little while, but shouldn't
take more than a few minutes.

Doing this would break any indexes that might be created over files, and
would probably upset any active NFS mounts, and would likely have other
problems.  Se it would need to be a well-documented step with clear
rewards.

An alternative to renumbering would be to maintain file-ids and
subvolume-ids which are separate from the object-id.  Apparently reusing
subvolume object-ids is not possible and reusing file object-ids is
quite costly.  If the file-id were separate from the object-id, these
problems would vanish.

This would require extra space in the inode (there are several reserved
u64s, so that isn't a problem) and space in each directory entry (might
be more of a problem).  It would also require some way to keep track of
used (or unused) id numbers.  This avoids the cost of renumbering, by
spreading it out over every creation.  I suspect the average
inode-creation overhead could be kept quite low, but not quite zero.

I believe that some code *knows* that the root of any btrfs subvolumes
has inode number 256.  systemd seems to use this.  I have no idea what
else might depend on inode numbers in some way.

I suspect that if we tried to roll out a change like this, either almost
no-one would use it (if it wasn't the default), or things would start
breaking (if it was).  I'm not against breaking things, but we need to
be sure there is a solution for fixing them, and I'm certainly not up to
doing that myself.

So yes - I think that using a mkfs option would open up other avenues
for a solution.  There would still be a lot of work to find something
that continues to meet everyone's needs.

The advantage of an nfsd-focusses solution is that we can have working
code today with minimal down-sides.  I'm certainly not prepared to go
digging through btrfs code to determine how to implement a btrfs-only
solution without strong buy-in from btrfs maintainers.

NeilBrown