On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote: > On Tue, 12 Dec 2023, Kent Overstreet wrote: > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote: > > > On Sat, 09 Dec 2023, Kent Overstreet wrote: > > > > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote: > > > > > On 12/8/23 03:49, Kent Overstreet wrote: > > > > > > > > > > > We really only need 6 or 7 bits out of the inode number for sharding; > > > > > > then 20-32 bits (nobody's going to have a billion snapshots; a million > > > > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40 > > > > > > bits for actually allocating inodes out of. > > > > > > > > > > > > That'll be enough for the vast, vast majority of users, but exceeding > > > > > > that limit is already something we're technically capable of: we're > > > > > > currently seeing filesystems well over 100 TB, petabyte range expected > > > > > > as fsck gets more optimized and online fsck comes. > > > > > > > > > > 30 bits would not be enough even today: > > > > > > > > > > buczek@done:~$ df -i /amd/done/C/C8024 > > > > > Filesystem Inodes IUsed IFree IUse% Mounted on > > > > > /dev/md0 2187890304 618857441 1569032863 29% /amd/done/C/C8024 > > > > > > > > > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ). > > > > > > only 30 bits though. So it is a long way before you use all 32 bits. > > > How many volumes do you have? > > > > > > > > > > > > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1]. > > > > > > > > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for > > > > a unique identifier; time to start looking at how to extend statx. > > > > > > > > > > 64 should be plenty... > > > > > > If you have 32 bits for free allocation, and 7 bits for sharding across > > > 128 CPUs, then you can allocate many more than 4 billion inodes. Maybe > > > not the full 500 billion for 39 bits, but if you actually spread the > > > load over all the shards, then certainly tens of billions. > > > > > > If you use 22 bits for volume number and 42 bits for inodes in a volume, > > > then you can spend 7 on sharding and still have room for 55 of Donald's > > > filesystems to be allocated by each CPU. > > > > > > And if Donald only needs thousands of volumes, not millions, then he > > > could configure for a whole lot more headroom. > > > > > > In fact, if you use the 64 bits of vfs_inode number by filling in bits from > > > the fs-inode number from one end, and bits from the volume number from > > > the other end, then you don't need to pre-configure how the 64 bits are > > > shared. > > > You record inum-bits and volnum bits in the filesystem metadata, and > > > increase either as needed. Once the sum hits 64, you start returning > > > ENOSPC for new files or new volumes. > > > > > > There will come a day when 64 bits is not enough for inodes in a single > > > filesystem. Today is not that day. > > > > Except filesystems are growing all the time: that leaves almost no room > > for growth and then we're back in the world where users had to guess how > > many inodes they were going to need in their filesystem; and if we put > > this off now we're just kicking the can down the road until when it > > becomes really pressing and urgent to solve. > > > > No, we need to come up with something better. > > > > I was chatting a bit with David Howells on IRC about this, and floated > > adding the file handle to statx. It looks like there's enough space > > reserved to make this feasible - probably going with a fixed maximum > > size of 128-256 bits. > > Unless there is room for 128 bytes (1024bits), it cannot be used for > NFSv4. That would be ... sad. NFSv4 specs that for the maximum size? That is pretty hefty... > > Thoughts? > > > > I'm completely in favour of exporting the (full) filehandle through > statx. (If the application asked for the filehandle, it will expect a > larger structure to be returned. We don't need to use the currently > reserved space). > > I'm completely in favour of updating user-space tools to use the > filehandle to check if two handles are for the same file. > > I'm not in favour of any filesystem depending on this for correct > functionality today. As long as the filesystem isn't so large that > inum+volnum simply cannot fit in 64 bits, we should make a reasonable > effort to present them both in 64 bits. Depending on the filehandle is a > good plan for long term growth, not for basic functionality today. My standing policy in these situations is that I'll do the stopgap/hacky measure... but not before doing actual, real work on the longterm solution :) So if we're all in favor of statx as the real long term solution, how about we see how far we get with that?