Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 12 Dec 2023, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > > > On 12/8/23 03:49, Kent Overstreet wrote:
> > > > 
> > > > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > > > bits for actually allocating inodes out of.
> > > > > 
> > > > > That'll be enough for the vast, vast majority of users, but exceeding
> > > > > that limit is already something we're technically capable of: we're
> > > > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > > > as fsck gets more optimized and online fsck comes.
> > > > 
> > > > 30 bits would not be enough even today:
> > > > 
> > > > buczek@done:~$ df -i /amd/done/C/C8024
> > > > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > > > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > > > 
> > > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> > 
> > only 30 bits though.  So it is a long way before you use all 32 bits.
> > How many volumes do you have?
> > 
> > > > 
> > > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> > > 
> > > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> > > a unique identifier; time to start looking at how to extend statx.
> > > 
> > 
> > 64 should be plenty...
> > 
> > If you have 32 bits for free allocation, and 7 bits for sharding across
> > 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> > not the full 500 billion for 39 bits, but if you actually spread the
> > load over all the shards, then certainly tens of billions.
> > 
> > If you use 22 bits for volume number and 42 bits for inodes in a volume,
> > then you can spend 7 on sharding and still have room for 55 of Donald's
> > filesystems to be allocated by each CPU.
> > 
> > And if Donald only needs thousands of volumes, not millions, then he
> > could configure for a whole lot more headroom.
> > 
> > In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> > the fs-inode number from one end, and bits from the volume number from
> > the other end, then you don't need to pre-configure how the 64 bits are
> > shared.
> > You record inum-bits and volnum bits in the filesystem metadata, and
> > increase either as needed.  Once the sum hits 64, you start returning
> > ENOSPC for new files or new volumes.
> > 
> > There will come a day when 64 bits is not enough for inodes in a single
> > filesystem.  Today is not that day.
> 
> Except filesystems are growing all the time: that leaves almost no room
> for growth and then we're back in the world where users had to guess how
> many inodes they were going to need in their filesystem; and if we put
> this off now we're just kicking the can down the road until when it
> becomes really pressing and urgent to solve.
> 
> No, we need to come up with something better.
> 
> I was chatting a bit with David Howells on IRC about this, and floated
> adding the file handle to statx. It looks like there's enough space
> reserved to make this feasible - probably going with a fixed maximum
> size of 128-256 bits.

Unless there is room for 128 bytes (1024bits), it cannot be used for
NFSv4.  That would be ... sad.

> 
> Thoughts?
> 

I'm completely in favour of exporting the (full) filehandle through
statx. (If the application asked for the filehandle, it will expect a
larger structure to be returned.  We don't need to use the currently
reserved space).

I'm completely in favour of updating user-space tools to use the
filehandle to check if two handles are for the same file.

I'm not in favour of any filesystem depending on this for correct
functionality today.  As long as the filesystem isn't so large that
inum+volnum simply cannot fit in 64 bits, we should make a reasonable
effort to present them both in 64 bits.  Depending on the filehandle is a
good plan for long term growth, not for basic functionality today.

Thanks,
NeilBrown





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux