On Fri, Mar 15, 2024 at 10:52:43AM -0700, Luis Chamberlain wrote: > On Fri, Mar 15, 2024 at 02:48:27AM +0000, Matthew Wilcox wrote: > > On Fri, Mar 15, 2024 at 12:14:05PM +1100, Dave Chinner wrote: > > > On Thu, Mar 14, 2024 at 05:12:22PM -0700, Luis Chamberlain wrote: > > > > Joining two 8 EB files with device-mapper seems allowed: > > > > > > > > truncate -s 8EB /mnt-pmem/sparse-8eb.1; losetup /dev/loop1 /mnt-pmem/sparse-8eb.1 > > > > truncate -s 8EB /mnt-pmem/sparse-8eb.2; losetup /dev/loop2 /mnt-pmem/sparse-8eb.2 > > > > > > > > cat /home/mcgrof/dm-join-multiple.sh > > > > #!/bin/sh > > > > # Join multiple devices with the same size in a linear form > > > > # We assume the same size for simplicity > > > > set -e > > > > size=`blockdev --getsz $1` > > > > FILE=$(mktemp) > > > > for i in $(seq 1 $#) ; do > > > > offset=$(( ($i -1) * $size)) > > > > echo "$offset $size linear $1 0" >> $FILE > > > > shift > > > > done > > > > cat $FILE | dmsetup create joined > > > > rm -f $FILE > > > > > > > > /home/mcgrof/dm-join-multiple.sh /dev/loop1 /dev/loop2 > > > > > > > > And mkfs.xfs seems to go through on them, ie, its not rejected > > > > > > Ah, I think mkfs.xfs has a limit of 8EiB on image files, maybe not > > > on block devices. What's the actual limit of block device size on > > > Linux? > > > > We can't seek past 2^63-1. That's the limit on lseek, llseek, lseek64 > > or whatever we're calling it these days. If we're missing a check > > somewhere, that's a bug. > > Thanks, I can send fixes, just wanted to review some of these things > with the community to explore what a big fat linux block device or > filesystem might be constrained to, if any. The fact that through this > discussion we're uncovering perhaps some missing checks is already > useful. I'll try to document some of it. I don't really care about some random documentation on some random website about some weird corner case issue. Just fix the problems you find and get the patches to mkfs.xfs merged. Realistically, though, we just haven't cared about mkfs.xfs behaviour at that scale because of one main issue: have you ever waited for mkfs.xfs to create and then mount an ~8EiB XFS filesystem? You have to wait through the hundreds of millions on synchronous writes (as in "waits for each submitted write to complete", not O_SYNC) that mkfs.xfs needs to do to create the filesystem, and then wait through the hundreds of millions of synchronous reads that mount does in the kernel to allow the filesystem to mount. Hence we have not done any real validation of behaviour at that scale because of the time and resource cost involved in just creating and mounting filesystems at that scale. Unless you have many, many hours to burn every time you want mkfs and mount a XFS filesystem, it's just not practical to even do basic functional testing at this scale. And, really, mkfs.xfs is the least of the problems that need addressing before we can test filesystems that large. We do full filesystem AG walks at mount that need to be avoided, we need tens of GB of RAM to hold all the AG information in kernel memory (we can't demand free per-AG information yet - that's part of the problem that makes shrink so complex), we have algorithms that do linear AG walks that depend on AG information being held in memory, etc. When you're talking about an algorithm that can iterate all AGs in the filesystem 3 times before failing and having 8.4 million AGs indexed, this is a serious scalability problem. IOWs, we've got years of development ahead of us to scale the filesystem implementation out to handle filesystems larger than a few PiB effciently - mkfs.xfs limits are the most trivial of things compared to the deep surgery that is needed to make 64 bit capacity support a production-quality reality.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx