Re: Max theoretical XFS filesystem size in review

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote:
> On Thu, Feb 08, 2024 at 10:54:08AM +1100, Dave Chinner wrote:
> > On Wed, Feb 07, 2024 at 02:26:53PM -0800, Luis Chamberlain wrote:
> > > I'd like to review the max theoretical XFS filesystem size and
> > > if block size used may affect this. At first I thought that the limit which
> > > seems to be documented on a few pages online of 16 EiB might reflect the
> > > current limitations [0], however I suspect its an artifact of both
> > > BLKGETSIZE64 limitation. There might be others so I welcome your feedback
> > > on other things as well.
> > 
> > The actual limit is 8EiB, not 16EiB. mkfs.xfs won't allow a
> > filesystem over 8EiB to be made.
> 
> A truncated 9 EB file seems to go through:

<sigh>

9EB  = 9000000000000000000
8EiB = 9223372036854775808

So, 9EB < 8EiB and yes, mkfs.xfs will accept anything smaller than
8EiB...

> truncate -s 9EB /mnt-pmem/sparse-9eb; losetup /dev/loop0 /mnt-pmem/sparse-9eb
> mkfs.xfs -K /dev/loop0
> meta-data=/dev/loop0             isize=512    agcount=8185453, agsize=268435455 blks

yup, agcount is clearly less than 8388608, so you've screwed up your
units there...

> Joining two 8 EB files with device-mapper seems allowed:
> 
> truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1
> truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2
> 
> cat /home/mcgrof/dm-join-multiple.sh 
> #!/bin/sh
> # Join multiple devices with the same size in a linear form
> # We assume the same size for simplicity
> set -e
> size=`blockdev --getsz $1`
> FILE=$(mktemp)
> for i in $(seq 1 $#) ; do
>         offset=$(( ($i -1)  * $size))
> 	echo "$offset $size linear $1 0" >> $FILE
> 	shift
> done
> cat $FILE | dmsetup create joined
> rm -f $FILE
> 
> /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2
> 
> And mkfs.xfs seems to go through on them, ie, its not rejected

Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not
on block devices. What's the actual limit of block device size on
Linux?

> mkfs.xfs -f /dev/mapper/joined
> meta-data=/dev/mapper/joined     isize=512    agcount=14551916, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=3906250000000000, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> Discarding blocks...
> 
> I didn't wait, should we be rejecting that?

Probably. mkfs.xfs uses uint64_t for the block counts and
arithmetic, so all the size and geometry calcs should work. The
problem is when we translate those sizes to byte counts, and then th
elinux kernel side has all sorts of problems because many things
described in bytes (like off_t and loff_t) are signed. Hence while
you might be able to make block devices larger than 8EiB, I'm pretty
sure you can't actually do things like pread()/pwrite() at offsets
above 8EiB on block devices....

> Using -K does hit some failures on the bno number though:
> 
> mkfs.xfs -K -f /dev/mapper/joined
> meta-data=/dev/mapper/joined     isize=512    agcount=14551916, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=3906250000000000, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> mkfs.xfs: pwrite failed: Invalid argument
> libxfs_bwrite: write failed on (unknown) bno 0x6f05b59d3b1f00/0x100, err=22

daddr is 0x6f05b59d3b1f00. So lets convert that to a byte based
offset from a buffer daddr:

$ printf "0x%llx\n" $(( 0x6f05b59d3b1f00 << 9  ))
0xde0b6b3a763e0000
$

It's hard to see, but if I write it as 16 bit couplets:

	0xde0b 6b3a 763e 0000

You can see the high bit in the file offset is set, and so that's a
write beyond 8EiB that returned -EINVAL. That exactly what
rw_verify_area() returns when loff_t *pos < 0 when the file does not
assert FMODE_UNSIGNED_OFFSET. No block based filesystem nor do block
devices assert FMODE_UNSIGNED_OFFSET, so this write should always
fail with -EINVAL.

And where did it fail? You used "-f" which set force_overwrite,
which means we do a bit of zeroing of potential locations for old
XFS structures (secondary superblocks) and that silently swallows IO
failures, so it wasn't that. The next thing it does is whack
potential MD and GPT records at the end of the filesystem and that's
done in IO sizes of:

/*
 * amount (in bytes) we zero at the beginning and end of the device to
 * remove traces of other filesystems, raid superblocks, etc.
 */
#define WHACK_SIZE (128 * 1024)

128kB IOs. The above IO that failed with -EINVAL is a 128kB IO
(0x100 basic blocks). This will emit a warning message that the IO
failed (as per above), but it also swallows IO errors and lets mkfs
continue.

> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: No space left on device
> libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=28

Yup, that's the next write to zap the first blocks of the device to
get rid of primary superblocks and other signatures from other types
of filesytsems and partition tables. That failed with -ENOSPC, which
implies something went wrong in the dm/loop device IO/backing
file IO stage. Likely an 8EiB overflow problem somewhere.

> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: No space left on device
> libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=28

And that's the initial write of the superblock (single 512 byte
sector write) that failed with ENOSPC. Same error as the previous
write, same likely cause.

> mkfs.xfs: Releasing dirty buffer to free list!
> mkfs.xfs: libxfs_device_zero seek to offset 8000000394407514112 failed: Invalid argument

And yeah, their's the smoking gun: mkfs.xfs is attempting to seek to
an offset beyond 8EiB on the block device and that is failing.

IOWs, max supported block device size on Linux is 8EiB. mkfs.xfs
should really capture some of these errors, but largely the problem
here is that dm is allowing an unsupported block device mapping
to be created...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux