On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote: > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote: > > [This is silly and has no real purpose except to explore the limits. > > If that offends you, don't read the rest of this email.] > > We do this quite frequently ourselves, even if it is just to remind > ourselves how long it takes to wait for millions of IOs to be done. > > > I am trying to create an XFS filesystem in a partition of approx > > 2^63 - 1 bytes to see what happens. > > Should just work. You might find problems with the underlying > storage, but the XFS side of things should just work. > I'm trying to reproduce it here: > > $ grep vdd /proc/partitions > 253 48 9007199254739968 vdd > $ sudo mkfs.xfs -f -s size=1024 -d size=2251799813684887b -N /dev/vdd > meta-data=/dev/vdd isize=512 agcount=8388609, agsize=268435455 blks > = sectsz=1024 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=0 > data = bsize=4096 blocks=2251799813684887, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=1024 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > > And it is running now without the "-N" and I have to wait for tens > of millions of IOs to be issued. The write rate is currently about > 13,000 IOPS, so I'm guessing it'll take at least an hour to do > this. Next time I'll run it on the machine with faster SSDs. > > I haven't seen any error after 20 minutes, though. I killed it after 2 and half hours, and started looking at why it was taking that long. That's the above. But it's not fast. This is the first time I've looked at whether we perturbed the IO patterns in the recent mkfs.xfs refactoring. I'm not sure we made them any worse (the algorithms are the same), but it's now much more obvious how we can improve them drastically with a few small mods. Firstly, there's the force overwrite alogrithm that zeros the old filesystem signature. One an 8EB device with an existing 8EB filesystem, there's 8+ million single sector IOs right there. So for the moment, zero the first 1MB of the device to whack the old superblock and you can avoid this step. I've got a fix for that now: Time to mkfs a 1TB filsystem on a big device after it held another larger filesystem: previous FS size 10PB 100PB 1EB old mkfs time 1.95s 8.9s 81.3s patched 0.95s 1.2s 1.2s Second, use -K to avoid discard (which you already know). Third, we do two passes over the AG headers to initialise them. Unfortunately, with a large number of AGs, they don't stay in the buffer cache and so the second pas involves RMW cycles. This means we do at least 5 extra read and 5 extra write IOs per AG than we need to. I've got a fix for this, too: Time to make a filesystem from scratch, using a zeroed device so the force overwrite algorithms are not triggered and -K to avoid discards: FS size 10PB 100PB 1EB current mkfs 26.9s 214.8s 2484s patched 11.3s 70.3s 709s >From that projection, the 8EB mkfs would have taken somewhere around 7-8 hours to complete. The new code should only take a couple of hours. Still not all that good.... .... and I think that's because we are using direct IO. That means the IO we issue is effectively synchronous, even though we sorta doing delayed writeback. The problem is that mkfs is not threaded so writeback happens when the cache fills up and we run out of buffers on the free list. Basically it's "direct delayed writeback" at that point. Worse, because it's synchronous, we don't drive more than one IO at a time and so we don't get adjacent sector merging, even though most ofhte AG header writes are to adjacent sectors. That would cut the amount of IOs from ~10 per AG down to 2 for sectorsize < blocksize filesysetms and 1 for sectorsize = blocksize filesystems. This isn't so easy to fix. I either need to: 1) thread the libxfs buffer cache so we can do this writeback in the background. 2) thread mkfs so it can process multiple AGs at once; or 3) libxfs needs to use AIO via delayed write infrastructure similar to what we have in the kernel (buffer lists) Approach 1) does not solve the queue depth = 1 issue, so it's of limited value. Might be quick, but doesn't really get us much improvement. Approach 2) drives deeper queues, but it doesn't solve the adjacent sector IO merging problem because each thread only has a queue depth of one. So we'll be able to do more IO, but IO efficiency won't improve. And, realistically, this isn't a good idea because OOO AG processing doesn't work on spinning rust - it just causes seek storms and things go slower. To make things faster on spinning rust, we need single threaded, in order dispatch, asynchronous writeback. Which is almost what 1) is, except it's not asynchronous. That's what 3) solves - single threaded, in-order, async writeback, controlled by the context creating the dirty buffers in a limited AIO context. I'll have to think about this a bit more.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx