Thanks to you and to Junio for your replies, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > No, no. The fastest way to store a bunch of files is to basically: > > - make the filenames as small as possible > > This improves density, and thus performance. You'll get more files to > fit in a smaller directory, and filename compares etc will be faster > too. > > If you don't need hashes, but can do with smaller names (for example, > your names are really just sequential, and you can use some base-64 > encoding to make them smaller than numbers), you'll always be better > off. This is really interesting and I would not have suspected it. But it begs the question: why does Git use the base-16 hash instead of the base-64 hash? After your replies I took a serious look at Git's storage and there is indeed not that many loose objects in a typical repo: most is kept into packs. So I guess Git doesn't need that much density and keeping the filename in the format that is used in the UI probably helps power users. > - store them in a good final order. This is debatable, but depending on > your load and the filesystem, it can be better to make sure that you > create all files in order, because performance can plummet if you start > removing files later. > > I suspect your benchmark *only* tested this case, That is true. > but if you want to check odder cases, try creating a huge > directory, and then deleting most files, and then adding a few > new ones. Some filesystems will take a huge hit because they'll > still scan the whole directory, even though it's mostly empty! > > (Also, a "readdir() + stat()" loop will often get *much* worse access > patterns if you've mixed deletions and creations) This is something that will be interesting to benchmark later on. So, an application with a lot of turnaround, say a mail server, should delete and re-create the directories from time to time? I assume this is specific to some file system types. > - it's generally *much* more efficient to have one large file that you > read and seek in than having many small ones, so if you can change your > load so that you don't have tons of files at all, you'll probably be > better off. That makes a lot of sense. Thanks again for those clarifications. -- Yannick Gingras - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html