Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > > > > On 12/8/23 03:49, Kent Overstreet wrote:
> > > > > 
> > > > > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > > > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > > > > bits for actually allocating inodes out of.
> > > > > > 
> > > > > > That'll be enough for the vast, vast majority of users, but exceeding
> > > > > > that limit is already something we're technically capable of: we're
> > > > > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > > > > as fsck gets more optimized and online fsck comes.
> > > > > 
> > > > > 30 bits would not be enough even today:
> > > > > 
> > > > > buczek@done:~$ df -i /amd/done/C/C8024
> > > > > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > > > > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > > > > 
> > > > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> > > 
> > > only 30 bits though.  So it is a long way before you use all 32 bits.
> > > How many volumes do you have?
> > > 
> > > > > 
> > > > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> > > > 
> > > > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> > > > a unique identifier; time to start looking at how to extend statx.
> > > > 
> > > 
> > > 64 should be plenty...
> > > 
> > > If you have 32 bits for free allocation, and 7 bits for sharding across
> > > 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> > > not the full 500 billion for 39 bits, but if you actually spread the
> > > load over all the shards, then certainly tens of billions.
> > > 
> > > If you use 22 bits for volume number and 42 bits for inodes in a volume,
> > > then you can spend 7 on sharding and still have room for 55 of Donald's
> > > filesystems to be allocated by each CPU.
> > > 
> > > And if Donald only needs thousands of volumes, not millions, then he
> > > could configure for a whole lot more headroom.
> > > 
> > > In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> > > the fs-inode number from one end, and bits from the volume number from
> > > the other end, then you don't need to pre-configure how the 64 bits are
> > > shared.
> > > You record inum-bits and volnum bits in the filesystem metadata, and
> > > increase either as needed.  Once the sum hits 64, you start returning
> > > ENOSPC for new files or new volumes.
> > > 
> > > There will come a day when 64 bits is not enough for inodes in a single
> > > filesystem.  Today is not that day.
> > 
> > Except filesystems are growing all the time: that leaves almost no room
> > for growth and then we're back in the world where users had to guess how
> > many inodes they were going to need in their filesystem; and if we put
> > this off now we're just kicking the can down the road until when it
> > becomes really pressing and urgent to solve.
> > 
> > No, we need to come up with something better.
> > 
> > I was chatting a bit with David Howells on IRC about this, and floated
> > adding the file handle to statx. It looks like there's enough space
> > reserved to make this feasible - probably going with a fixed maximum
> > size of 128-256 bits.
> 
> Unless there is room for 128 bytes (1024bits), it cannot be used for
> NFSv4.  That would be ... sad.

NFSv4 specs that for the maximum size? That is pretty hefty...

> > Thoughts?
> > 
> 
> I'm completely in favour of exporting the (full) filehandle through
> statx. (If the application asked for the filehandle, it will expect a
> larger structure to be returned.  We don't need to use the currently
> reserved space).
> 
> I'm completely in favour of updating user-space tools to use the
> filehandle to check if two handles are for the same file.
> 
> I'm not in favour of any filesystem depending on this for correct
> functionality today.  As long as the filesystem isn't so large that
> inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> effort to present them both in 64 bits.  Depending on the filehandle is a
> good plan for long term growth, not for basic functionality today.

My standing policy in these situations is that I'll do the stopgap/hacky
measure... but not before doing actual, real work on the longterm
solution :)

So if we're all in favor of statx as the real long term solution, how
about we see how far we get with that?




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux