On Thu, Apr 14, 2022 at 02:00:24PM +1000, Chris Dunlop wrote: > Hi, > > I have a nearly full 30T xfs filesystem that I need to grow significantly, > e.g. to, say, 256T, and potentially further in future, e.g. up to, say, 1PB. That'll be fun. :) > Alternatively at some point I'll need to copy a LOT of data from the > existing fs to a newly-provisioned much larger fs. If I'm going to need to > copy data around I guess it's better to do it now, before there's a whole > lot more data to copy. > > According to Dave Chinner: > > https://www.spinics.net/lists/linux-xfs/msg20084.html > Rule of thumb we've stated every time it's been asked in the past 10-15 > years is "try not to grow by more than 10x the original size". > > It's also explained the issue is the number of AGs. > > Is it ONLY the number of AGs that's a concern when growing a fs? No. > E.g. for a fs starting in the 10s of TB that may need to grow substantially > (e.g. >=10x), is it advisable to simply create it with the maximum available > agsize, and you can then grow to whatever multiple without worrying about > XFS getting ornery? If you start with anything greater 4-32TB, there's a good chance you've already got maximally sized AGs.... > Looking my fs and just considering the number of AGs (agcount)... > > My original fs has: > > meta-data=xxxx isize=512 agcount=32, agsize=244184192 blks Which is just short of maximally sized AGs. There's nothing to be gained by reformatting to larger AGs here. > = sectsz=4096 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 > = reflink=1 bigtime=0 inobtcount=0 > data = bsize=4096 blocks=7813893120, imaxpct=5 > = sunit=128 swidth=512 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > If I do a test xfs_grow to 256T, it shows: > > metadata=xxxxx isize=512 agcount=282, agsize=244184192 blks > > Creating a new fs on 256T, I get: > > metadata=xxxxx isize=512 agcount=257, agsize=268435328 blks Yup. > So growing the fs from 30T to 256T I end up with an agcount ~10% larger (and > agsize ~10% smaller) than creating a 256T fs from scratch. Yup. > Just for the exercise, creating a new FS on 1P (i.e. 33x the current fs) > gives: > > metadata=xxxxx isize=512 agcount=1025, agsize=268435328 blks Yup. > I.e. it looks like for this case the max agsize is 268435328 blocks. Yup. > So even > if the current fs were to grow to a 1P or more, e.g. 30x - 60x original, I'm > still only going to be ~10% worse off in terms of agcount than creating a > large fs from scratch and copying all the data over. Yup. > Is that really going to make a significant difference? No. But there will be significant differences. e.g. think of the data layout and free space distribution of a 1PB filesystem that it is 90% full and had it's data evenly distributed throughout it's capacity. Now consider the free space distribution of a 100TB filesystem that has been filled to 99% and then grown by 100TB nine times to a capacity of 90% @ 1PB. Where is all the free space? That's right - the free space is only in the region that was appended in the last 100TB grow operation. IOWs, 90% of the AGs are completley full, and the newly added 10% are compeltely empty. However, the allocation algorithms do linear target increments and linear scans over *all AGs* trying to distribute the allocation across the entire filesystem and to find the best available free space for allocations. When you have hundreds of AGs and only 10% of them have usable free space, this becomes a problem. e.g. if the locality algorithm targets low numbered AGs that are full (and it will because the target increments and wraps in a linear fashion), then it might be scanning hundreds of AGs before it finds one of the recently added high numbered AGs with a big enough free space to allocate from. Then consider that it is not unreasonable for the filesystem to hit this case for thousands of consecutive allocations at a time (e.g. untarring a tarball full of small files such as a kernel source tree will trigger this), maybe even occur for every single allocation over a time span of minutes or even hours. IOWs, the scanning algorithms don't really scale to large numbers of AGs when most of the AGs are full and cannot be allocated from, and repeatedly growing full filesystems pushes the algorithms into highly undesirable corner cases much, much faster than filesystems that started off with that capacity,,, IOWs, growing by more than 10x really starts to push the limits of the algorithms regardless of the AG count it results in. It's not a capacity thing - it's a reflection of the number of AGs with usable free space in them and the algorithms used to find free space in those AGs. The algorithms can be fixed, but it's not been an important issue to solve because so few people are using grow[*] in this manner - growing once or twice is generally as much as occurs over the life of a typical production filesysetm... Cheers, Dave. [*] Now, if you have a 2GB filesystem and you grow it to several TB (that's a nasty antipattern we see quite frequently in cloud deployments) then having 10,000+ tiny AGs has these linear scan problems as well as all sorts of other scalability issues related to the sheer number of AGs, but that's a different set of large ag count problems.... -- Dave Chinner david@xxxxxxxxxxxxx