On Sat, 25 Oct 2008, Samuel Abels wrote:
On Fri, 2008-10-24 at 13:11 -0700, david@xxxxxxx wrote:
git commit explicitly (i.e., walking the tree to stat files for finding
changes is not necessary).
I suspect that your limits would be filesystem/OS limits more than git
limits
at 5-10 files/commit you are going to be creating .5-1m files/day, even
spread across 256 directories this is going to be a _lot_ of files.
The files are organized in a way that places no more than ~1.000 files
into each directory. Will Git create a directory containing a larger
number of object files? I can see that this would be a problem in our
use case.
when git stores the copies of the files it does a sha1 hash of the file
contents and then stores the file in the directory
.git/objects/<first two digits of the hash>/<hash>
this means that if you have files that have the same content they all fold
togeather, but with lots of files changing rapidly the result is a lot of
files in these object directories.
it would be a pretty minor change to git to have it use more directories
(in fact, there's another thread going on today where people are looking
at making this configurable, in that case to reduce the number of
directories)
the other storage format that git has is the pack file. it takes a bunch
of the objects, does some comparisons between them (to find duplicate bits
of files), and then stores the result (base files plus deltas to re-create
other files). the resulting compression is _extremely_ efficiant, and it
collapses many file objects into one pack file (addressing the issues of
many files in one directory)
packing this may help (depending on how much the files change), but with
this many files the work of doing the packing would be expensive.
We can probably do that even if it takes several hours.
my concern is that spending time creating the pack files will mean that
you don't have time to insert the new files.
that being said, there may be other ways of dealing with this data rather
than putting it into files and then adding it to the git repository.
Git has a fast-import streaming format that is designed for programs to
use that are converting repositories from other SCM systems. if you can
tell more about what you are doing (how the data is being gathered, are
the files re-created for each commit, or are they being modified? if they
are being modified is it appending data, changing some data, or randomly
writing throughout the file? etc) there may be some other options
available.
at this point I don't know if git can work for you or not, but I'm pretty
sure nothing else will have a chance with your size.
David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html