Re: Max theoretical XFS filesystem size in review

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 18 Mar 2024 11:00:15 +1100

On Fri, Mar 15, 2024 at 10:52:43AM -0700, Luis Chamberlain wrote:
> On Fri, Mar 15, 2024 at 02:48:27AM +0000, Matthew Wilcox wrote:
> > On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote:
> > > On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote:
> > > > Joining two 8 EB files with device-mapper seems allowed:
> > > > 
> > > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1
> > > > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2
> > > > 
> > > > cat /home/mcgrof/dm-join-multiple.sh 
> > > > #!/bin/sh
> > > > # Join multiple devices with the same size in a linear form
> > > > # We assume the same size for simplicity
> > > > set -e
> > > > size=`blockdev --getsz $1`
> > > > FILE=$(mktemp)
> > > > for i in $(seq 1 $#) ; do
> > > >         offset=$(( ($i -1)  * $size))
> > > > 	echo "$offset $size linear $1 0" >> $FILE
> > > > 	shift
> > > > done
> > > > cat $FILE | dmsetup create joined
> > > > rm -f $FILE
> > > > 
> > > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2
> > > > 
> > > > And mkfs.xfs seems to go through on them, ie, its not rejected
> > > 
> > > Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not
> > > on block devices. What's the actual limit of block device size on
> > > Linux?
> > 
> > We can't seek past 2^63-1.  That's the limit on lseek, llseek, lseek64
> > or whatever we're calling it these days.  If we're missing a check
> > somewhere, that's a bug.
> 
> Thanks, I can send fixes, just wanted to review some of these things
> with the community to explore what a big fat linux block device or
> filesystem might be constrained to, if any. The fact that through this
> discussion we're uncovering perhaps some missing checks is already
> useful. I'll try to document some of it.

I don't really care about some random documentation on some random
website about some weird corner case issue. Just fix the problems
you find and get the patches to mkfs.xfs merged.

Realistically, though, we just haven't cared about mkfs.xfs
behaviour at that scale because of one main issue: have you ever
waited for mkfs.xfs to create and then mount an ~8EiB XFS
filesystem?

You have to wait through the hundreds of millions on
synchronous writes (as in "waits for each submitted write to
complete", not O_SYNC) that mkfs.xfs needs to do to create the
filesystem, and then wait through the hundreds of millions of
synchronous reads that mount does in the kernel to allow the
filesystem to mount.

Hence we have not done any real validation of behaviour at that
scale because of the time and resource cost involved in just
creating and mounting filesystems at that scale. Unless you have
many, many hours to burn every time you want mkfs and mount a XFS
filesystem, it's just not practical to even do basic functional
testing at this scale.

And, really, mkfs.xfs is the least of the problems that need
addressing before we can test filesystems that large. We do full
filesystem AG walks at mount that need to be avoided, we need tens
of GB of RAM to hold all the AG information in kernel memory (we
can't demand free per-AG information yet - that's part of the
problem that makes shrink so complex), we have algorithms that do
linear AG walks that depend on AG information being held in memory,
etc. When you're talking about an algorithm that can iterate all AGs
in the filesystem 3 times before failing and having 8.4 million AGs
indexed, this is a serious scalability problem.

IOWs, we've got years of development ahead of us to scale the
filesystem implementation out to handle filesystems larger than a
few PiB effciently - mkfs.xfs limits are the most trivial of things
compared to the deep surgery that is needed to make 64 bit capacity
support a production-quality reality....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx