Re: Bad Metadata performances for XFS?

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 5 Jul 2016 17:29:42 +1000

[Please fix you mail program to correctly quote replies - I've done
it manually here so i could work out what your wrote ]

On Tue, Jul 05, 2016 at 01:43:33AM +0000, Wang Shilong wrote:
> From: Dave Chinner [david@xxxxxxxxxxxxx]
> On Tue, Jul 05, 2016 at 08:52:26AM +1000, Dave Chinner wrote:
> > On Mon, Jul 04, 2016 at 05:32:40AM +0000, Wang Shilong wrote:
> > > dd 16GB to /dev/shm/data to use memory backend storage to benchmark metadata performaces.
> 
> > I've never seen anyone create a ramdisk like that before.
> > What's the backing device type? i.e. what block device driver does
> > this use?
> 
> I guess you mean loop device here? It is common file and setup
> as loop0 device here.

For me, the "common" way to test a
filesystem with RAM backing it is to use the brd driver because it
can do DAX, is as light weight and scalable, and doesn't have any of
the quirks that the loop device has.

This is why I ask people to fully describe their hardware, software and
config - assumptions only lead to misunderstandings.

> > > Benchmark tool is mdtest, you can download it from
> > > https://sourceforge.net/projects/mdtest/
> >
> > What version? The sourceforge version, of the github fork that the
> > sourceforge page points to? Or the forked branch of recent
> > development in the github fork?
> 
> I don't think sourceforge version or github version make some
> differences here, you could use any of them.(I used Souceforge version)

They are different, and there's evidence of many nasty hacks in the
github version. it appears that some of them come from the source
forge version. Not particularly confidence inspiring.

> > > Steps to run benchmark
> > > #mkfs.xfs /dev/shm/data
> 
> > Output of this command so we can recreate the same filesystem
> > structure?
> 
> [root@localhost shm]# mkfs.xfs data
> meta-data=data                   isize=512    agcount=4, agsize=1025710 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0
> data     =                       bsize=4096   blocks=4102840, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

As I suspected, mkfs optimised the layout for the small size,
not performance. Performance will likely improve if you increase the
log size to something more reasonably sized for heavy metadata
workloads.

> > > #mount /dev/shm/data /mnt/test
> > > #mdtest -d /mnt/test -n 2000000
> > >
> > > 1 tasks, 2000000 files/directories
> > >
> > > SUMMARY: (of 1 iterations)
> > >    Operation                      Max            Min           Mean        Std Dev
> > >    ---------                      ---            ---           ----        -------
> > >    Directory creation:      24724.717      24724.717      24724.717          0.000
> > >    Directory stat    :    1156009.290    1156009.290    1156009.290          0.000
> > >    Directory removal :     103496.353     103496.353     103496.353          0.000
> > >    File creation     :      23094.444      23094.444      23094.444          0.000
> > >    File stat         :    1158704.969    1158704.969    1158704.969          0.000
> > >    File read         :     752731.595     752731.595     752731.595          0.000
> > >    File removal      :     105481.766     105481.766     105481.766          0.000
> > >    Tree creation     :       2229.827       2229.827       2229.827          0.000
> > >    Tree removal      :          1.275          1.275          1.275          0.000
> > >
> > > -- finished at 07/04/2016 12:54:26 --
> 
> > A table of numbers with no units or explanation as to what they
> > mean. Let me guess - I have to read the benchmark source code to
> > understand what the numbers mean?
> 
> You could look File Creation, Units mean number of files create per seconds.
> (Here it is 23094.444)

Great. What about all the others? How is the directory creation
number different to file creation? What about "tree creation"? What
is the difference between them - a tree implies multiple things are
being indexed, so that's got to be different in some way from file
and directory creation?

Indeed, if these are all measuring operations per second, then why
is tree creation 2000x faster than tree removal when file and
directory removal are 4x faster than creation? They can't all be
measuring single operations, and so the numbers are essentially
meaningless without being able to understand how they are different.

> > > IOPS for file creation is only 2.3W, however compare to Ext4 with same testing.
> 
> > Ummm - what unit of measurement is "W"? Watts?
> 
> Sorry, same as above..

So you made it up?

> > IOWs: Being CPU bound at 25,000 file creates/s is in line with
> > what I'd expect on XFS for a single threaded, single directory
> > create over 2 million directory entries with the default 4k
> > directory block size....
> ----------
> 
> I understand that this is single thread Limit, but I guess there are some
> other Limit here, because even single thread creating 50W files speed
> is twice than 200W files.

What's this W unit mean now? It's not 10000ops/s, like above,
because that just makes no sense at all.  Again: please stop
using shorthand or abbreviations that other people will not
understand. If you meant "the file  create speed is different when
creating 50,000 files versus creating 200,000 files", then write it
out in full because then everyone understands exactly what you mean.

/Assuming/ this is what you meant, then it's pretty obvious why they
are different - it's basic CS alogrithms and math. Answer these two
questions, and you have your answer as to what is going on:

	1. How does the CPU overhead of btree operation scale with
	increasing numbers of items in the btree?

	2. What does that do to the *average* insert rate for N
	insertions into an empty tree for increasing values of N?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs