Re: On the many files problem

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 31 Dec 2007 12:45:04 -0800 (PST)

On Mon, 31 Dec 2007, Yannick Gingras wrote:
> 
> This is really interesting and I would not have suspected it.  But it
> begs the question: why does Git use the base-16 hash instead of the
> base-64 hash?  

Because I was stupid, and I'ma lot more used to hex numbers than to 
base-64.

Also, the way the original encoding worked, the SHA1 of the object was 
actually the SHA1 of the *compressed* object, so you could verify the 
integrity of the object writing by just doing a

	sha1sum .git/objects/4f/a7491b073032b57c7fcf28c9222a5fa7b3a6b9

and it would return 4fa7491b073032b57c7fcf28c9222a5fa7b3a6b9 if everything 
was good. That was a nice bonus in the first few days of git development, 
when it all was a set of very low-level object routines hung together with 
prayers and duct-tape.

So consider it historical. It wasn't worth fixing, since it became obvious 
that the real fix would never be to try to make the individual files or 
filenames smaller.

> >    (Also, a "readdir() + stat()" loop will often get *much* worse access 
> >    patterns if you've mixed deletions and creations)
> 
> This is something that will be interesting to benchmark later on.  So,
> an application with a lot of turnaround, say a mail server, should
> delete and re-create the directories from time to time?  I assume this
> is specific to some file system types.

This is an issue only for certain filesystems, and it's also an issue only 
for certain access patterns.

A mail server, for example, will seldom *scan* the directory. It will just 
open individual files by name. So it won't be hit by the "readdir+stat" 
issue, unless you actually do a "ls -l".

(There are exceptions. Some mailbox formats use a file per email in a 
directory. And yes, they tend to suck from a performance angle).

And you can avoid it. For example, on most unixish filesystems, you can 
get better IO access patterns by doing the readdir() into an array, then 
sorting it by inode number, and then doing the stat() in that order: that 
*often* (but not always - there's no guarantee what the inode number 
actually means) gives you better disk access patterns.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html