Re: Performance impact of a large number of commits

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 25 Oct 2008, Samuel Abels wrote:

On Fri, 2008-10-24 at 13:11 -0700, david@xxxxxxx wrote:
git commit explicitly (i.e., walking the tree to stat files for finding
changes is not necessary).

I suspect that your limits would be filesystem/OS limits more than git
limits

at 5-10 files/commit you are going to be creating .5-1m files/day, even
spread across 256 directories this is going to be a _lot_ of files.

The files are organized in a way that places no more than ~1.000 files
into each directory. Will Git create a directory containing a larger
number of object files? I can see that this would be a problem in our
use case.

when git stores the copies of the files it does a sha1 hash of the file contents and then stores the file in the directory
.git/objects/<first two digits of the hash>/<hash>
this means that if you have files that have the same content they all fold togeather, but with lots of files changing rapidly the result is a lot of files in these object directories.

it would be a pretty minor change to git to have it use more directories (in fact, there's another thread going on today where people are looking at making this configurable, in that case to reduce the number of directories)

the other storage format that git has is the pack file. it takes a bunch of the objects, does some comparisons between them (to find duplicate bits of files), and then stores the result (base files plus deltas to re-create other files). the resulting compression is _extremely_ efficiant, and it collapses many file objects into one pack file (addressing the issues of many files in one directory)

packing this may help (depending on how much the files change), but with
this many files the work of doing the packing would be expensive.

We can probably do that even if it takes several hours.

my concern is that spending time creating the pack files will mean that you don't have time to insert the new files.

that being said, there may be other ways of dealing with this data rather than putting it into files and then adding it to the git repository.

Git has a fast-import streaming format that is designed for programs to use that are converting repositories from other SCM systems. if you can tell more about what you are doing (how the data is being gathered, are the files re-created for each commit, or are they being modified? if they are being modified is it appending data, changing some data, or randomly writing throughout the file? etc) there may be some other options available.

at this point I don't know if git can work for you or not, but I'm pretty sure nothing else will have a chance with your size.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux