On Tue, Aug 24, 2021 at 09:22:05AM +1000, NeilBrown wrote: > On Mon, 23 Aug 2021, Zygo Blaxell wrote: > ... > > > > Subvol IDs are not reusable. They are embedded in shared object ownership > > metadata, and persist for some time after subvols are deleted. > ... > > > > The cost of _tracking_ free object IDs is trivial compared to the cost > > of _reusing_ an object ID on btrfs. > > One possible approach to these two objections is to decouple inode > numbers from object ids. This would be reasonable for subvol IDs (I thought of it earlier in this thread, but didn't mention it because I wasn't going to be the first to open that worm can ;). There aren't very many subvol IDs and they're not used as frequently as inodes, so a lookup table to remap them to smaller numbers to save st_ino bit-space wouldn't be unreasonably expensive. If we stop right here and use the [some_zeros:reversed_subvol:inode] bit-packing scheme you proposed for NFS, that seems like a reasonable plan. It would have 48 bits of usable inode number space, ~440000 file creates per second for 20 years with up to 65535 snapshots, the same number of bits that ZFS has in its inodes. Once that subvol ID mapping tree exists, it could also map subvol inode numbers to globally unique numbers. Each tree item would contain a map of [subvol_inode1..subvol_inode2] that maps the inode numbers in the subvol into the global inode number space at [global_inode1..global_inode2]. When a snapshot is created, the snapshot gets a copy of all the origin subvol's inode ranges, but with newly allocated base offsets. If the original subvol needs new inodes, it gets a new chunk from the global inode allocator. If the snapshot subvol needs new inodes, it gets a different new chunk from the global allocator. The minimum chunk might be a million inodes or so to avoid having to allocate new chunks all the time, but not so high to make the code completely untested (or testers just set the minchunk to 1000 inodes). The question I have (and why I didn't propose this earlier) is whether this scheme is any real improvement over dividing the subvol:inode space by bit packing. If you have one subvol that has 3 billion existing inodes in it, every snapshot of that subvol is going to burn up roughly 2^-32 of the available globally unique inode numbers. If we burn 3 billion inodes instead of 4 billion per subvol, it only gets 25% more lifespan for the filesystem, and the allocation of unique inode spaces and tracking inode space usage will add cost to every single file creation and snapshot operation. If your oldest/biggest subvol only has a million inodes in it, all of the above is pure cost: you can create billions of snapshots, never repeat any object IDs, and never worry about running out. I'd want to see cost/benefit simulations of: this plan, the simpler but less efficient bit-packing plan, 'cp -a --reflink' to a new subvol and start over every 20 years when inodes run out, and online garbage-collection/renumbering schemes that allow users to schedule the inode renumbering costs in overnight batches instead of on every inode create. > The inode number becomes just another piece of metadata stored in the > inode. > struct btrfs_inode_item has four spare u64s, so we could use one of > those. > struct btrfs_dir_item would need to store the inode number too. What > is location.offset used for? Would a diritem ever point to a non-zero > offset? Could the 'offset' be used to store the inode number? Offset is used to identify subvol roots at the moment, but so far that means only values 0 and UINT64_MAX are used. It seems possible to treat all other values as inode numbers. Don't quote me on that--I'm not an expert on this structure. > This could even be added to existing filesystems I think. It might not > be easy to re-use inode numbers smaller than the largest at the time the > extension was added, but newly added inode numbers could be reused after > they were deleted. We'd need a structure to track reusable inode numbers and it would have to be kept up to date to work, so this feature would necessarily come with an incompat bit. Whether you borrow bits from existing structures or make extended new structures doesn't matter at that point, though obviously for something as common as inodes it would be bad to make them bigger. Some of the btrfs userspace API uses inode numbers, but unless I missed something, it could all be converted to use object numbers directly instead. > Just a thought... > > NeilBrown