Re: Performance impact of a large number of commits

david@xxxxxxx · Fri, 24 Oct 2008 22:29:58 -0700 (PDT)

On Sat, 25 Oct 2008, Samuel Abels wrote:

On Fri, 2008-10-24 at 13:11 -0700, david@xxxxxxx wrote:
git commit explicitly (i.e., walking the tree to stat files for finding
changes is not necessary).

I suspect that your limits would be filesystem/OS limits more than git
limits

at 5-10 files/commit you are going to be creating .5-1m files/day, even
spread across 256 directories this is going to be a _lot_ of files.

The files are organized in a way that places no more than ~1.000 files
into each directory. Will Git create a directory containing a larger
number of object files? I can see that this would be a problem in our
use case.

when git stores the copies of the files it does a sha1 hash of the file 
contents and then stores the file in the directory
.git/objects/<first two digits of the hash>/<hash>
this means that if you have files that have the same content they all fold 
togeather, but with lots of files changing rapidly the result is a lot of 
files in these object directories.

it would be a pretty minor change to git to have it use more directories 
(in fact, there's another thread going on today where people are looking 
at making this configurable, in that case to reduce the number of 
directories)

the other storage format that git has is the pack file. it takes a bunch 
of the objects, does some comparisons between them (to find duplicate bits 
of files), and then stores the result (base files plus deltas to re-create 
other files). the resulting compression is _extremely_ efficiant, and it 
collapses many file objects into one pack file (addressing the issues of 
many files in one directory)

packing this may help (depending on how much the files change), but with
this many files the work of doing the packing would be expensive.

We can probably do that even if it takes several hours.

my concern is that spending time creating the pack files will mean that 
you don't have time to insert the new files.

that being said, there may be other ways of dealing with this data rather 
than putting it into files and then adding it to the git repository.

Git has a fast-import streaming format that is designed for programs to 
use that are converting repositories from other SCM systems. if you can 
tell more about what you are doing (how the data is being gathered, are 
the files re-created for each commit, or are they being modified? if they 
are being modified is it appending data, changing some data, or randomly 
writing throughout the file? etc) there may be some other options 
available.

at this point I don't know if git can work for you or not, but I'm pretty 
sure nothing else will have a chance with your size.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html