Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Sep 2018 19:12:30 +1000

On Tue, Sep 04, 2018 at 06:23:32PM +1000, Dave Chinner wrote:
> On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > > [This is silly and has no real purpose except to explore the limits.
> > > If that offends you, don't read the rest of this email.]
> > 
> > We do this quite frequently ourselves, even if it is just to remind
> > ourselves how long it takes to wait for millions of IOs to be done.
> > 
> > > I am trying to create an XFS filesystem in a partition of approx
> > > 2^63 - 1 bytes to see what happens.
> > 
> > Should just work. You might find problems with the underlying
> > storage, but the XFS side of things should just work.
> 
> > I'm trying to reproduce it here:
> > 
> > $ grep vdd /proc/partitions 
> >  253       48 9007199254739968 vdd
> > $ sudo mkfs.xfs -f -s size=1024 -d size=2251799813684887b -N /dev/vdd
> > meta-data=/dev/vdd               isize=512    agcount=8388609, agsize=268435455 blks
> >          =                       sectsz=1024  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=1, rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=2251799813684887, imaxpct=1
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=521728, version=2
> >          =                       sectsz=1024  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > 
> > And it is running now without the "-N" and I have to wait for tens
> > of millions of IOs to be issued. The write rate is currently about
> > 13,000 IOPS, so I'm guessing it'll take at least an hour to do
> > this. Next time I'll run it on the machine with faster SSDs.
> > 
> > I haven't seen any error after 20 minutes, though.
> 
> I killed it after 2 and half hours, and started looking at why it
> was taking that long. That's the above.

Or the below. Stand on your head if you're confused.

-Dave.

> But it's not fast. This is the first time I've looked at whether we
> perturbed the IO patterns in the recent mkfs.xfs refactoring. I'm
> not sure we made them any worse (the algorithms are the same), but
> it's now much more obvious how we can improve them drastically with
> a few small mods.
> 
> Firstly, there's the force overwrite alogrithm that zeros the old
> filesystem signature. One an 8EB device with an existing 8EB
> filesystem, there's 8+ million single sector IOs right there.
> So for the moment, zero the first 1MB of the device to whack the
> old superblock and you can avoid this step. I've got a fix for that
> now:
> 
> 	Time to mkfs a 1TB filsystem on a big device after it held another
> 	larger filesystem:
> 
> 	previous FS size	10PB	100PB	 1EB
> 	old mkfs time		1.95s	8.9s	81.3s
> 	patched			0.95s	1.2s	 1.2s
> 
> 
> Second, use -K to avoid discard (which you already know).
> 
> Third, we do two passes over the AG headers to initialise them.
> Unfortunately, with a large number of AGs, they don't stay in the
> buffer cache and so the second pas involves RMW cycles. This means
> we do at least 5 extra read and 5 extra write IOs per AG than we
> need to. I've got a fix for this, too:
> 
> 	Time to make a filesystem from scratch, using a zeroed device so the
> 	force overwrite algorithms are not triggered and -K to avoid
> 	discards:
> 
> 	FS size         10PB    100PB    1EB
> 	current mkfs    26.9s   214.8s  2484s
> 	patched         11.3s    70.3s	 709s
> 
> From that projection, the 8EB mkfs would have taken somewhere around
> 7-8 hours to complete. The new code should only take a couple of
> hours. Still not all that good....
> 
> .... and I think that's because we are using direct IO. That means
> the IO we issue is effectively synchronous, even though we sorta
> doing delayed writeback. The problem is that mkfs is not threaded so
> writeback happens when the cache fills up and we run out of buffers
> on the free list. Basically it's "direct delayed writeback" at that
> point.
> 
> Worse, because it's synchronous, we don't drive more than one IO at
> a time and so we don't get adjacent sector merging, even though most
> ofhte AG header writes are to adjacent sectors. That would cut the
> amount of IOs from ~10 per AG down to 2 for sectorsize < blocksize
> filesysetms and 1 for sectorsize = blocksize filesystems.
> 
> This isn't so easy to fix. I either need to:
> 
> 	1) thread the libxfs buffer cache so we can do this
> 	  writeback in the background.
> 	2) thread mkfs so it can process multiple AGs at once; or
> 	3) libxfs needs to use AIO via delayed write infrastructure
> 	similar to what we have in the kernel (buffer lists)
> 
> Approach 1) does not solve the queue depth = 1 issue, so
> it's of limited value. Might be quick, but doesn't really get us
> much improvement.
> 
> Approach 2) drives deeper queues, but it doesn't solve the adjacent
> sector IO merging problem because each thread only has a queue depth
> of one. So we'll be able to do more IO, but IO efficiency won't
> improve. And, realistically, this isn't a good idea because OOO AG
> processing doesn't work on spinning rust - it just causes seek
> storms and things go slower. To make things faster on spinning rust,
> we need single threaded, in order dispatch, asynchronous writeback.
> Which is almost what 1) is, except it's not asynchronous.
> 
> That's what 3) solves - single threaded, in-order, async writeback,
> controlled by the context creating the dirty buffers in a limited
> AIO context.  I'll have to think about this a bit more....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 

-- 
Dave Chinner
david@xxxxxxxxxxxxx