Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 12 Dec 2023, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > > > > > On 12/8/23 03:49, Kent Overstreet wrote:
> > > > > > 
> > > > > > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > > > > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > > > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > > > > > bits for actually allocating inodes out of.
> > > > > > > 
> > > > > > > That'll be enough for the vast, vast majority of users, but exceeding
> > > > > > > that limit is already something we're technically capable of: we're
> > > > > > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > > > > > as fsck gets more optimized and online fsck comes.
> > > > > > 
> > > > > > 30 bits would not be enough even today:
> > > > > > 
> > > > > > buczek@done:~$ df -i /amd/done/C/C8024
> > > > > > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > > > > > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > > > > > 
> > > > > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> > > > 
> > > > only 30 bits though.  So it is a long way before you use all 32 bits.
> > > > How many volumes do you have?
> > > > 
> > > > > > 
> > > > > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> > > > > 
> > > > > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> > > > > a unique identifier; time to start looking at how to extend statx.
> > > > > 
> > > > 
> > > > 64 should be plenty...
> > > > 
> > > > If you have 32 bits for free allocation, and 7 bits for sharding across
> > > > 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> > > > not the full 500 billion for 39 bits, but if you actually spread the
> > > > load over all the shards, then certainly tens of billions.
> > > > 
> > > > If you use 22 bits for volume number and 42 bits for inodes in a volume,
> > > > then you can spend 7 on sharding and still have room for 55 of Donald's
> > > > filesystems to be allocated by each CPU.
> > > > 
> > > > And if Donald only needs thousands of volumes, not millions, then he
> > > > could configure for a whole lot more headroom.
> > > > 
> > > > In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> > > > the fs-inode number from one end, and bits from the volume number from
> > > > the other end, then you don't need to pre-configure how the 64 bits are
> > > > shared.
> > > > You record inum-bits and volnum bits in the filesystem metadata, and
> > > > increase either as needed.  Once the sum hits 64, you start returning
> > > > ENOSPC for new files or new volumes.
> > > > 
> > > > There will come a day when 64 bits is not enough for inodes in a single
> > > > filesystem.  Today is not that day.
> > > 
> > > Except filesystems are growing all the time: that leaves almost no room
> > > for growth and then we're back in the world where users had to guess how
> > > many inodes they were going to need in their filesystem; and if we put
> > > this off now we're just kicking the can down the road until when it
> > > becomes really pressing and urgent to solve.
> > > 
> > > No, we need to come up with something better.
> > > 
> > > I was chatting a bit with David Howells on IRC about this, and floated
> > > adding the file handle to statx. It looks like there's enough space
> > > reserved to make this feasible - probably going with a fixed maximum
> > > size of 128-256 bits.
> > 
> > Unless there is room for 128 bytes (1024bits), it cannot be used for
> > NFSv4.  That would be ... sad.
> 
> NFSv4 specs that for the maximum size? That is pretty hefty...

It is - but it needs room to identify the filesystem and it needs to be
stable across time.  That need is more than a local filesystem needs.

NFSv2 allowed 32 bytes which is enough for a 16 byte filesys uuid, 8
byte inum and 8byte generation num.  But only just.

NFSv3 allowed 64 bytes which was likely plenty for (nearly?) every
situation.

NFSv4 doubled it again because .... who knows.  "why not" I guess.
Linux nfsd typically uses 20 or 28 bytes plus whatever the filesystem
wants. (28 when the export point is not the root of the filesystem).
I suspect this always fits within an NFSv3 handle except when
re-exporting an NFS filesystem.  NFS re-export is an interesting case...


> 
> > > Thoughts?
> > > 
> > 
> > I'm completely in favour of exporting the (full) filehandle through
> > statx. (If the application asked for the filehandle, it will expect a
> > larger structure to be returned.  We don't need to use the currently
> > reserved space).
> > 
> > I'm completely in favour of updating user-space tools to use the
> > filehandle to check if two handles are for the same file.
> > 
> > I'm not in favour of any filesystem depending on this for correct
> > functionality today.  As long as the filesystem isn't so large that
> > inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> > effort to present them both in 64 bits.  Depending on the filehandle is a
> > good plan for long term growth, not for basic functionality today.
> 
> My standing policy in these situations is that I'll do the stopgap/hacky
> measure... but not before doing actual, real work on the longterm
> solution :)

Eminently sensible.

> 
> So if we're all in favor of statx as the real long term solution, how
> about we see how far we get with that?
> 

I suggest:

 STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
                              same inode number

 
 __u64 stx_vol     Volume identifier.  Two files with same stx_vol and 
                   stx_ino MUST be the same.  Exact meaning of volumes
                   is filesys-specific
 
 STATX_VOL         Want stx_vol

  __u8 stx_handle_len  Length of stx_handle if present
  __u8 stx_handle[128] Unique stable identifier for this file.  Will
                       NEVER be reused for a different file.
                       This appears AFTER __statx_pad2, beyond
                       the current 'struct statx'.
 STATX_HANDLE      Want stx_handle_len and stx_handle. Buffer for
                   receiving statx info has at least
                   sizeof(struct statx)+128 bytes.

I think both the handle and the vol can be useful.
NFS can provide stx_handle but not stx_vol.  It is the thing
to use for equality testing, but it is only needed if
STATX_ATTR_INUM_NOT_UNIQUE is set.
stx_vol is useful for "du -x" or maybe "du --one-volume" or similar.


Note that we *could* add stx_vol to NFSv4.2.  It is designed for
incremental extension.  I suspect we wouldn't want to rush into this,
but to wait to see if different volume-capable filesystems have other
details of volumes that are common and can usefully be exported by statx
- or NFS.

NeilBrown





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux