Re: xfs performance problem

pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi) · Sun, 1 May 2011 15:38:25 +0100

> [ ... ]

[ ... Extracting a kernel 'tar' with GNU tar on 'ext3': ]
>>> real    0m21.769s
[ ... Extracting a kernel 'tar' with GNU tar on XFS: ]
>>> real    2m20.522s

>> [ ... ] in most cases the wrong number is the one for 'ext3'
>> on RAID1 (way too small). Even the number for XFS and RAID0
>> 'delaylog' is a wrong number (somewhat small) in many cases.

>> There are 38000 files in 440MB in 'linux-2.6.38.tar', ~40% of
>> them are smaller than 4KiB and ~60% smaller than 8KiB. Also you
>> didn't flush caches, and you don't say whether the filesystems
>> are empty or full or at the same position on the disk.
>> 
>> Can 'ext3' really commit 1900 small files per second (including
>> directory updates) to a filesystem on a RAID1 that probably can
>> do around 100 IOPS? That would be amazing news.

In the real world 'ext3' as reported in my previous message can
"really commit" around 50 "small files per second (including
directory updates)" in near-optimal conditions to a storage
device that can proboably do around 100IOPS; copying here the
actual numbers:

  % mount -t ext3 -o relatime /dev/sdb /mnt/sdb
  % time sh -c 'cd /mnt/sdb; star -x -b 2048 -f /tmp/linux-2.6.38.tar; cd /; umount /mnt/sdb'
  star: 420 blocks + 81920 bytes (total of 440483840 bytes = 430160.00k).

  real    12m49.610s
  user    0m0.990s
  sys     0m8.610s
  ....
  % df -BM /mnt/sdb
  Filesystem           1M-blocks      Used Available Use% Mounted on
  /dev/sdb               469455M      687M   444922M   1% /mnt/sdb
  % df -i /mnt/sdb
  Filesystem            Inodes   IUsed   IFree IUse% Mounted on
  /dev/sdb             30531584   38100 30493484    1% /mnt/sdb

As a side note, even 12m49.610s is probably a bit optimistic
because of the 1s timestamp resolution of 'ext3':

  http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg272253.html

> Of course it can.

And a pony! Or rather 'O_PONIES' :-).

> Why? Because the allocator is optimised to pack small files
> written at the same time together on disk, and the elevator
> will merge them into one large IO when they are finally
> written to disk. With a typical 512k max IO size, that's 128
> <=4k files packed into each IO,

This is an argument based on a cunning or distracted or ignorant
shift of the goalposts: because this is an argument about purely
*writing* the *data* in those small files, while the bigger
issue is *committing* the *metadata*, all of it "(including
directory updates)". Also, this argument is also based on the
assumption that it is permissible to commit 128 small files when
the last one gets closed, not when each gets committed.

In this discussion it is rather comical to make an argument
based on the speed of IO using what is in effect EatMyData as
described here:

  http://talk.maemo.org/showthread.php?t=67901

but here it is:

> In a perfect world, we're talking about ~13000 4k files a
> second being written to disk @ 100 IOPS. In the real world,
> writing an order of magnitude less files per second is quite
> obtainable.

But in the real world the "quite obtainable" number with 'ext3'
for "really commit [ ... ]  small files per second (including
directory updates)" on storage that "probably can do around 100
IOPS" is around *50* (fifty), not 1,300, never mind 13,000.

Sure if one want to look instead at  whatever number they can
get with their clever "benchmarks"  one can get:

    % mount -t ext3 -o relatime /dev/sdb /mnt/sdb
    % time sh -c 'cd /mnt/sdb; star -x -b 2048 -f /tmp/linux-2.6.38.tar -no-fsync; cd /; umount /mnt/sdb'
    star: 420 blocks + 81920 bytes (total of 440483840 bytes = 430160.00k).

    real    0m27.414s
    user    0m0.270s
    sys     0m2.430s

That's a fantastic result, somewhat over 1,300 small files per
second (14 commits per nominal IOPS), but "fantastic" (as in
fantasy) is the keyword, because it is for completely different
and broken semantics, a point that should not be lost on anybody
who can "understand IOPS and metadata and commits and caching".

It is not as if the difference isn't widely known:

  http://cdrecord.berlios.de/private/man/star/star.1.html

    Star is a very fast tar(1) like tape archiver with improved
    functionality. 
    On operating systems with slow file I/O (such as Linux), it
    may help to use -no-fsync in addition, but then star is
    unable to detect all error conditions; so use with care. 

That GNU 'tar' does not commit files when extracting is pretty
old news, and therefore as I wrote in a previous message on a
similar detail:

  There is something completely different: a tradeoff between
  levels of safety (whether you want committed transactions or
  not and how finely grained) and time to completion. 

  But when one sees comical "performance" comparisons without
  even cache flushing, explaining the difference between a
  performance problem and different safety/speed tradeoffs seems
  a bit wasted.
  Again, the fundamental problem is how many committed IOPS the
  storage system can do given a metadata (and thus journal)
  intensive load (the answer is "not many" per spinning medium). 

Plus of course:

>> Despite decades of seeing it happen, I keep being astonished by
>> how many people (some with decades of "experience") just don't
>> understand IOPS and metadata and commits and caching and who

> Oh, the irony.... :)

Indeed :-).

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs