Re: On the many files problem

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 29 Dec 2007 11:27:31 -0800

Yannick Gingras <ygingras@xxxxxxxxxxxx> writes:

> Greetings Git hackers,
>
> No doubt, you guys must have discussed this problem before but I will
> pretend that I can't find the relevant threads in the archive because
> Marc's search is kind of crude.
>
> I'm coding an application that will potentially store quite a bunch of
> files in the same directory so I wondered how I should do it.  I tried
> a few different files systems and I tried path hashing, that is,
> storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113.  As
> far as I can tell, that's what Git does.  It turned out to be slower
> than anything except ext3 without dir_index.

We hash like d3/b07384d113, but your understanding of we do is
more or less right.

If we never introduced packed object storage, this issue may
have mattered and we might have looked into it further to
improve the loose object access performance.  But in reality, no
sane git user would keep millions of loose objects unpacked.
And changing the layout would mean a backward incompatible
change for dumb transport clients.  There is practically no
upside and are downsides to change it now.

Traditionally, avoiding large directories when dealing with a
large number of files by path hashing was a tried and proven
wisdom in many applications (e.g. web proxies, news servers).
Newer filesystems do have tricks to let you quickly access a
large number of files in a single directory, and that lessens
the need for the applications to play path hashing games.

That is a good thing, but if that trick makes the traditional
way of dealing with a large number of files _too costly_, it may
be striking the balance at a wrong point.  That is favoring
newly written applications that assume that large directories
are Ok (or ones written by people who do not know the historical
behaviour of filesystems), by punishing existing practices too
heavily.

The person who is guilty of introducing the hashed loose object
store is intimately familiar with Linux.  I do not speak for
him, but if I have to guess, the reason he originally chose the
path hashing was because he just followed the tradition, and he
did not want to make the system too dependent on Linux or a
particular feature of underlying filesystems.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html