Junio, This message came to me at exactly the right time. Yesterday I was exploring using git as the content storage back-end for some binary files. Up until now I've only used it for software projects. I found the largest RCS file that we had in our current back-end. It contained twelve versions of a binary file. Each version averaged about 20 MB. The ,v file from RCS was about 250MB. I did some experiments on these binary files. First, gzip consistantly is able to compress these files to about 10% their original size. So, they are quite inflated. Second, xdelta would produce a delta between two neighboring revisions of about 2.5MB in size that would compress down to about 2MB. (about the same size as the next revision compressed without deltification so packing is ineffective here). I added these 12 revisions to several version control back-ends including subversion and git. Git produced a much smaller repository size than the others simply due to the compression that it applies to objects. It also was at least as fast as the others. The problem came when I tried to clone this repository. git-pack-objects chewed on these 12 revisions for over an hour before I finally interrupted it. As far as I could tell, it hadn't made much progress. My other complaint was that git prune ran slow (~8 seconds on my very fast machine with fast disk access) on a repository with only these twelve revisions in it (37 total objects in the object store). This is because 'git prune' actually ends up running fsck on all of the objects which verifies the sha1 of each object. This seems like a lot of work just to prune unwanted objects. What would you say to a --fast option to git-prune that would avoid most of what fsck does including verifying sha1 for each object? Anyway, that was a tangent. I looked into to overriding the --depth option to git-pack-objects and set it to 0. However, this isn't trivial. git-pack-objects is never called directly by the user. It is only called through things like 'git clone', 'git push' and 'git repack'. What do you think about this? Could we add a configuration option that could be set for the repository? Something smarter like what you suggest where git would pack small text files but give up on large binaries would be optimal. I've already determined that packing a repository with this type of largish binary file doesn't do any good but there doesn't seem to be a way to avoid packing when doing network operations. Thoughts? Carl On Fri, Feb 24, 2006 at 12:49:13AM -0800, Junio C Hamano wrote: > Nicolas Pitre <nico@xxxxxxx> writes: > > > Indexing based on adler32 has a match precision based on the block size > > (currently 16). Lowering the block size would produce smaller deltas > > but the indexing memory and computing cost increases significantly. > > Indeed. > > I had this patch in my personal tree for a while. I was > wondring why sometimes progress indication during "Deltifying" > stage stops for literally several seconds, or more. > > In Linux 2.6 repository, these object pairs take forever to > delta. > > blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f -> > blob dfc9cd58dc065d17030d875d3fea6e7862ede143 > size (491102 -> 496045) > 58 seconds > > blob 4917ec509720a42846d513addc11cbd25e0e3c4f -> > blob dfc9cd58dc065d17030d875d3fea6e7862ede143 > size (495831 -> 496045) > 64 seconds > > Admittedly, these are *BAD* input samples (a binary firmware > blob with many similar looking ", 0x" sequences). I can see > that trying to reuse source materials really hard would take > significant computation. > > However, this is simply unacceptable. > > The new algoritm takes 58 seconds to produce 136000 bytes of > delta, while the old takes 0.25 seconds to produce 248899 (I am > using the test-delta program in git.git distribution). The > compression ratio is significantly better, but this is unusable > even for offline archival use (remember, pack delta selection > needs to do window=10 such deltification trials to come up with > the best delta, so you are spending 10 minutes to save 100k from > one oddball blob), let alone on-the-fly pack generation for > network transfer. > > Maybe we would want two implementation next to each other, and > internally see if it is taking too much cycles compared to the > input size then switch to cheaper version? > > - > : send the line "unsubscribe git" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Carl Baldwin RADCAD (R&D CAD) Hewlett Packard Company MS 88 work: 970 898-1523 3404 E. Harmony Rd. work: Carl.N.Baldwin@xxxxxx Fort Collins, CO 80525 home: Carl@xxxxxxxxxxxxx - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html