Re: Many small files, best practise.

Ric Wheeler <rwheeler@xxxxxxxxxx> · Mon, 14 Sep 2009 07:34:39 -0400

On 09/14/2009 05:40 AM, Peter Grandi wrote:

RHEL 5.3
~1000.000.000 files (1-30k)
~7TB in total
//

I'm looking for a best practice when implementing this using
EXT3 (or some other FS if it shouldn't do the job.).

"best practice" would be a rather radical solution.

On average the reads dominate (99%), writes are only used for
updating and isn't a part of the service provided.  The data
is divided into 200k directories with each some 5k files.
This ratio (dir/files) can be altered to optimize FS
performance.

If you are writing to a local S-ATA disk, ext3/4 can write a
few thousand files/sec without doing any fsync() operations.
With fsync(), you will drop down quite a lot.

Unfortunately using 'fsync' is a good idea for production
systems.

Also note that in order to write 10^9 files at 10^3/s rate takes
10^6 seconds; roughly 10 days to populate the filesystem (or at
least that to restore it from backups).

One thing that you can do when doing bulk loads of files (say, during a 
restore or migration), is to use a two phase write. First, write each of 
a batch of files (say 1000 files at a time), then go back and 
reopen/fsync/close them.

This will give you performance levels closer to not using fsync() and 
still give you good data integrity. Note that this usually is a good fit 
for this class of operations since you can always restart the bulk load 
if you have a crash/error/etc.

To give this a try, you can use "fs_mark" to write say 100k files with 
the fsync one file at a time (-S 1, its default) or use one of the batch 
fsync modes (-S 3 for example).

One layout for directories that works well with this kind of
thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
MIN might be 0, 5, 10, ..., 55 for example).

As to the problem above and ths kind of solution, I reckon that
it is utterly absurd (and I could have used much stronger words).

When you deal with systems that store millions of files, you pretty much 
always are going to use some kind of made up directory layout. The above 
scheme works pretty well in that it correlates well to normal usage 
patterns and queries (and tends to have those subdirectories laid out 
contiguously).

You can always try to write 1 million files in a single subdirectory, 
but if you are writing your own application, using this kind of scheme 
is pretty trivial.

   BTW, the sort of people who consider seriously such utter
   absurdities try to do a thorough job, and I don't want to
   know how the underlying storage system is structured :-).

If anything, consider the obvious (obvious except to those who
want to use a filesystem as a small record database), which is
'fsck' time, in particular given the structure of 'ext3' (or
'ext4') metadata.

fsck time has improved quite a lot recently with ext4 (and with xfs).

So: just don't use a filesystem as a database, spare us the
horror; use a database, even a simple one, which is not utterly
absurd.

Compare these two:

   http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html

In this case, doing the bulk load I described above (reading in sorted 
order, writing out in the same), would significantly reduce the time of 
the restore.

   http://lists.gllug.org.uk/pipermail/gllug/2005-October/055488.html

Anyhow I do see a lot of inane questions and "solutions" like
the above in various lists (usually the XFS one, which attracts
a lot of utter absurdities).

When reading files in ext3 (and ext4) or doing other bulk
operations like a large deletion, it is important to sort the
files by inode (do the readdir, get say all of the 5k files in
your subdir and then sort by inode before doing your bulk
operation).

Good idea, but it is best to avoid the cases where this matters.

_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users