Re: ext4 and extremely slow filesystem traversal

pg_ext3@xxxxxxxxxxxxxxxxxxx (Peter Grandi) · Wed, 13 Mar 2013 21:29:32 +0000

> I have troubles with the daily backup of a modest filesystem
> which tends to take more that 10 hours. [ ... ]  with 196 GB
> (9,3M inodes) used.

That is roughly 1M inodes/hour and 20GB/hour, or  nearly 300
inodes/s and nearly 6MB/s. These are very good numbers for high
random IOPS loads, and as seen later, you have one.

> It's mounted 'defaults,noatime'.

That helps.

> It sits on a hardware RAID array thru plain LVM slices.

That's the pointless default... But does not particularly slow
things down here.

> The RAID array is a RAID5 running on 5x SATA 500G disks, with a
> battery-backed (RAM) cache and write-back cache policy. To be
> precise, it's an Areca 1231. The hardware RAID array use 64kB
> stripes and I've configured the filesystem with 4kB blocks and
> stride=16.

The striping or alignment are not relevant on reads, but the
stride matters a great deal as to metadata parallelism, and here
it is set to 64KiB. But the array stride is 16KiB (a 4-wide
stripe of 64KiB). But since it is an integral multiple it should
be about as good. And since the backup performance is pretty
good, that seems the case.

> It also has 0 reserved blocks.

That's usually a truly terrible setting (20% is a much better
value), but your filesystem is not very full anyhow.

> When I try to backup the problematic filesystem with tar, rsync
> or whatever tool traversing the whole filesystem, things are
> awful.

Rather they are pretty good. Each 500GB SATA disk can usually do
somewhat less than 100 random IOPS/second, there are 4 disks in
each stripe when reading, and you are getting nearly 300 inodes/s
and 5MB/s, quite close to the maximum. On random loads with
smallish records typical rotating disks have transfer rates of
0.5MB to 1.5 MB/s, and you are getting rather more than that
(mostly thanks to the 20KiB average inode size).

You are getting pretty good delivery from 'ext4' and a very low
random IOPS storage system on a highly randomized workload:

> I know that this filesystem has *lots* of directories, most
> with few or no files in them.

That's a really bad idea.

> Tonight I ran a simple 'find /path/to/vol -type d |pv -bl'
> (counts directories as they are found), I stopped it more than
> 2 hours later : it was not done, and already counted more than
> 2M directories.

That's the usual 1M inodes/s.

> [ ... ] I'm in search for any advice or direction to improve
> this situation. While keeping using ext4 of course :).

Well, any system administrator would tell you the same: your
backup workload and your storage system are mismatched, and the
best solution is probably to use 146GB SAS 15K RPM disks for the
same capacity (or more). Or perhaps recent enteprise level SSDs.

The "small file" problem is ancient, and I call it the
"mailstore" problems from its typical incarnation:

  http://www.sabi.co.uk/blog/12-thr.html#120429

> PS: I did ask to the developers to not abuse the filesystem
> that way,

The "I use the filesystem as a DBMS" attitude is really very
common among developers. It is cost-free to them, and backup (and
system) administrators bear the cost when the filesystem fills
up.  Because at the beginning everything looks fine. Designing
stuff that seems cheap and fast at the beginning even if it
becomes very bad after some time is a good way to look like a
winner in most organizations.

> and that in 2013 it's okay to have 10k+ files per directory...

It's not, it is a very bad idea. In 2013, just like in 1973, or
in 1993, it is a much better idea to use simple indexed files to
keep a collection of smallish records.

Directories are a classification system, not a database indexing
system. Here is an amusing report of the difference between the
two:

  http://www.sabi.co.uk/blog/anno05-4th.html#051016

> No success, so I guess I'll have to work around it.

As a backup administrator you can't get much better from your
situation. You are already getting nearly the best performance
for whole tree scans of very random small records on a low random
IOPS storage layer.

_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users