Re: On the many files problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sat, 29 Dec 2007, Yannick Gingras wrote:
> 
> I'm coding an application that will potentially store quite a bunch of
> files in the same directory so I wondered how I should do it.  I tried
> a few different files systems and I tried path hashing, that is,
> storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113.  As
> far as I can tell, that's what Git does.  It turned out to be slower
> than anything except ext3 without dir_index.

Note that git doesn't hash paths because it *wants* to, but because it 
*needs* to. 

Basically, what git very fundamnentally wants as the basic operation, is 
to look up a SHA1 hashed object. 

> Quick like that, I would be tempted to say that hashing paths always
> makes things slower but the Git development team includes people with
> really intimate knowledge of several file system implementations so
> I'm tempted to say that you guys know something that I don't.

No, no. The fastest way to store a bunch of files is to basically:

 - make the filenames as small as possible

   This improves density, and thus performance. You'll get more files to 
   fit in a smaller directory, and filename compares etc will be faster 
   too.

   If you don't need hashes, but can do with smaller names (for example, 
   your names are really just sequential, and you can use some base-64 
   encoding to make them smaller than numbers), you'll always be better 
   off.

 - store them in a good final order. This is debatable, but depending on 
   your load and the filesystem, it can be better to make sure that you 
   create all files in order, because performance can plummet if you start 
   removing files later.

   I suspect your benchmark *only* tested this case, but if you want to 
   check odder cases, try creating a huge directory, and then deleting 
   most files, and then adding a few new ones. Some filesystems will take 
   a huge hit because they'll still scan the whole directory, even though 
   it's mostly empty!

   (Also, a "readdir() + stat()" loop will often get *much* worse access 
   patterns if you've mixed deletions and creations)

 - it's generally *much* more efficient to have one large file that you 
   read and seek in than having many small ones, so if you can change your 
   load so that you don't have tons of files at all, you'll probably be 
   better off.

Anyway, git is not a good thing to look at for "lots of files" performance 
if you have any options. The reasons git does what it does is:

 - as mentioned, the SHA1 hash is the fundamental unit of lookup, so the 
   name *must* be the hash, since that's all we ever have!

 - If you don't have indexing, you really *really* don't want to have a 
   flat file structure (as your benchmark found out). And git didn't want 
   to assume you have indexing, since a lot of filesystems don't (I 
   suspect a lot of ext3 installations _still_ don't enable it, for 
   example, even if the capability is there!)

So the first point explains the use of hashes for filenames, and the 
second point explains the use of a single level of indexing. But neither 
of them are about absolute performance per se, both are about basic 
requirements.

For good performance, look at the git pack-files. Those are designed to be 
data-dense, and only a few files that can be opened and mapped just once. 
There, performance was one of the guiding goals (along with the ability to 
also use them for streaming, and still being *reasonably* simple, of 
course).

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux