Re: On the many files problem

Yannick Gingras <ygingras@xxxxxxxxxxxx> · Mon, 31 Dec 2007 05:13:45 -0500

Thanks to you and to Junio for your replies, 

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:
> No, no. The fastest way to store a bunch of files is to basically:
>
>  - make the filenames as small as possible
>
>    This improves density, and thus performance. You'll get more files to 
>    fit in a smaller directory, and filename compares etc will be faster 
>    too.
>
>    If you don't need hashes, but can do with smaller names (for example, 
>    your names are really just sequential, and you can use some base-64 
>    encoding to make them smaller than numbers), you'll always be better 
>    off.

This is really interesting and I would not have suspected it.  But it
begs the question: why does Git use the base-16 hash instead of the
base-64 hash?  

After your replies I took a serious look at Git's storage and there is
indeed not that many loose objects in a typical repo: most is kept
into packs.  So I guess Git doesn't need that much density and keeping
the filename in the format that is used in the UI probably helps power
users.

>  - store them in a good final order. This is debatable, but depending on 
>    your load and the filesystem, it can be better to make sure that you 
>    create all files in order, because performance can plummet if you start 
>    removing files later.
>
>    I suspect your benchmark *only* tested this case, 

That is true.

>    but if you want to check odder cases, try creating a huge
>    directory, and then deleting most files, and then adding a few
>    new ones. Some filesystems will take a huge hit because they'll
>    still scan the whole directory, even though it's mostly empty!
>
>    (Also, a "readdir() + stat()" loop will often get *much* worse access 
>    patterns if you've mixed deletions and creations)

This is something that will be interesting to benchmark later on.  So,
an application with a lot of turnaround, say a mail server, should
delete and re-create the directories from time to time?  I assume this
is specific to some file system types.

>  - it's generally *much* more efficient to have one large file that you 
>    read and seek in than having many small ones, so if you can change your 
>    load so that you don't have tons of files at all, you'll probably be 
>    better off. 

That makes a lot of sense.  

Thanks again for those clarifications.

-- 
Yannick Gingras
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html