Re: Debunking myths about metadata CRC overhead

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Jun 2013 20:19:37 +1000

On Tue, Jun 04, 2013 at 12:43:29PM +1000, Dave Chinner wrote:
> On Mon, Jun 03, 2013 at 03:00:53PM -0500, Geoffrey Wehrman wrote:
> > On Mon, Jun 03, 2013 at 05:44:52PM +1000, Dave Chinner wrote:
> > This will have significant impact
> > on SGI's DMF managed filesystems.
> 
> You're concerned about bulkstat performance, then? Bulkstat will CRC
> every inode it reads, so the increase in inode size is the least of
> your worries....
> 
> But bulkstat scalability is an unrelated issue to the CRC work,
> especially as bulkstat already needs application provided
> parallelism to scale effectively.

So, I just added a single threaded bulkstat pass to the fsmark
workload by passing xfs_fsr across the filesystem to test out what
impact it has. So, 50 million inodes in the directory structure:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
wall time	13m34.203s		14m15.266s
sys CPU		7m7.930s		8m52.050s
rate		61,425 inodes/s		58,479 inode/s
efficency	116,800 inodes/CPU/s	93,984 inodes/CPU/s

So, really it's not particularly significant in terms of performance
differential. Certainly there isn't anything signficant problem that
larger inodes cause.  For comparison, the 8-way find workloads:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
wall time	5m33.165s		8m18.256s
sys CPU		18m36.731s		22m2.277s
rate		150,055 inodes/s	100,400 inodes/s
efficiency	44,800 inodes/CPU/s	37,800 inodes/CPU/s

Which makes me think omething is not right with this bulkstat pass
I've just done. It's way too slow if a find+stat is 2-2.5x faster.

Ah, xfs_fsr only bulkstats 64 inodes at a time. That's right,
last time I did this I used bstat out of xfstests. On a CRC enabled
fs:

ninodes		runtime		sys time	read bw(IOPS)
64		14m01s		8m37s
128		11m20s		7m58s		35MB/s(5000)
256		 8m53s		7m24s		45MB/s(6000)
512		 7m24s		6m28s		55MB/s(7000)
1024		 6m37s		5m40s		65MB/s(8000)
2048		10m50s		6m51s		35MB/s(5000)
4096(default)	26m23s		8m38s

Ask bulkstat for too few or too much, and it all goes to hell.  So
if we get the bulkstat config right, a single threaded bulkstat is
faster than the 8-way find, and a whole lot more efficient at it.
But, still there is effectively no performance differential between
256 byte and 512 byte inodes worth talking about.

And, FWIW, I just hacked threading into bstat to run a thread per AG
and just scan a single AG per thread. It's not perfect - it counts
some inodes twice (threads*ninodes at most) before it detects it's
run into the next AG. This is on a 100TB filesystem, so it runs 100
threads. CRC enabled fs:

ninodes		runtime		sys time	read bw(IOPS)
64		 1m53s		10m25s		220MB/s (27000)
256		 1m52s		10m03s		220MB/s (27000)
1024		 1m55s		10m08s		210MB/s (26000)

So when it's threaded, the small request size just doesn't matter -
there's enough IO to drive the system to being IOPS bound and that
limits performance.

Just to go full circle, the differences between 256 byte inodes, no
CRCs and the crc enabled filesystem for a single threaded bulkstat:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
ninodes		   1024			     1024
wall time	   5m22s		     6m37s
sys CPU		   4m46s		     5m40s
bw(IOPS)	     40MB/s(5000)	     65MB/s(8000)
rate		155,300 inodes/s	126,000 inode/s
efficency	174,800 inodes/CPU/s	147,000 inodes/CPU/s

Both follow the same ninode profile, but there is less IO done for
the 256 byte inode filesystem and throughput is higher. There's no
big surprise there, what does surprise me is that the difference
isn't larger. Let's drive it to being I/O bound with threading:

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
ninodes		    256			    256
wall time	   1m02s		   1m52s
sys CPU		   7m04s		  10m03s
bw/IOPS		    210MB/s (27000)	    220MB/s (27000)
rate		806,500 inodes/s	446,500 inode/s
efficency	117,900 inodes/CPU/s	 82,900 inodes/CPU/s

The 256 byte inode test is completely CPU bound - it can't go any
faster than that, and it just so happens to be pretty close to IO
bound as well. So, while there's double the throughput for 256 byte
inodes, it raises an interesting question: why are all the IOs only
8k in size?

That means the inode readahead that bulkstat is doing is not being
combined down in the elevator - it is either being cancelled because
there is too much, or it is being dispatched immediately and so we
are being IOPS limited long before we should be. i.e. there's still
500MB of bandwidth available on this filesystem and we're issuing
sequential adjacent 8k IO.  Either way, it's not functioning as it
should.

<blktrace>

Yup, immediate, explicit unplug and dispatch. No readahead batching
and the unplug is coming from _xfs_buf_ioapply().  Well, that is
easy to fix.

		256 byte inodes,	512 byte inodes
		CRCs disabled		CRCs enabled
		---------------------------------------
ninodes		    256			    256
wall time	   1m02s		   1m08s
sys CPU		   7m07s		   8m09s
bw/IOPS		    210MB/s (13500)	    360MB/s (14000)
rate		806,500 inodes/s	735,300 inode/s
efficency	117,100 inodes/CPU/s	102,200 inodes/CPU/s

So, the difference in performance pretty much goes away. We burn
more bandwidth, but now the multithreaded bulkstat is CPU limited
for both non-crc, 256 byte inodes and CRC enabled 512 byte inodes.

What this says to me is that there isn't a bulkstat performance
problem that we need to fix apart from the 3 lines of code for the
readahead IO plugging that I just added.  It's only limited by
storage IOPS and available CPU power, yet the bandwidth is
sufficiently low that any storage system that SGI installs for DMF
is not going to be stressed by it. IOPS, yes. Bandwidth, no.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs