Re: wich filesystem to store > 250E6 small files in same or hashed dire

Adam Tauno Williams <awilliam@xxxxxxxxxxxxx> · Mon, 14 Mar 2011 16:34:03 -0400

On Mon, 2011-03-14 at 13:10 -0700, Dr. Ed Morbius wrote:
> on 13:10 Sat 12 Mar, Alain Spineux (aspineux@xxxxxxxxx) wrote:
> > Hi
> > I need to store about 250.000.000 files. Files are less than 4k.
> > On a ext4 (fedora 14)  the system crawl at 10.000.000 in the same directory.
> > I tried to create hash directories, two level of 4096 dir = 16.000.000
> > but I had to stop the script to create these dir after hours
> > and "rm -rf"  would have taken days ! mkfs was my friend
> > I tried two levels, first of 4096 dir, second of 64 dir. The creation
> > of the hash dir took "only" few minutes,
> > but copying 10000 files make my HD scream for 120s ! I take only 10s
> > when working in the same directory.
> > The filenames are all 27 chars and the first chars can be used to hash
> > the files.

Exactly {XY}/{XY}/{ABCDEFGHIJKLMNOPQRSTUVW} will probably work just
fine.  Two characters is 676 combinations, hardly a large directory, and
that puts less than 1,000 entries in a folder.

> > My question is : Which filesystem and how to store these files ?
> I'd also question the architecture and suggest an alternate approach:
> hierarchical directory tree, database, "nosql" hashing lookup, or other
> approach.  See squid for an example of using directory trees to handle
> very large numbers of objects.   

Exactly. Squid and Cyrus IMAPd both manage to store massive number of
objects in a filesystem using hashing.  It is simple and reliable.

I'd wonder if that is really an issue if you just don't have an actual
I/O throughput problem.  Certainly trying to do any solution on a
single-disk is going to be awful.

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos