Re: [PATCH] Add --no-reuse-delta option to git-gc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Steven Grimm <koreth@xxxxxxxxxxxxx> wrote:
> On that note, has any thought been given to looking at other compression 
> algorithms? Gzip is a great high-speed compressor, but there are others 
> out there (some a bit slower, some much slower at both compression and 
> decompression) that produce substantially smaller output.

Its been discussed once before on the list, in very recent history,
but not by a whole lot.  As Junio pointed out, I don't think there
ever really was any discussion of is gzip the best way to deflate the
objects.  I think gzip was just chosen simply because it was readily
available in libz, stable, and has a pretty decent speed/size ratio.
 
> I think it'd be kind of neat to have my .git directory shrink by another 
> 20+%. That's conservative; on maximumcompression.com's test of a mix of 
> different file types including images, gzip compresses 64% and the 
> best-scoring one does 80%. On English text gzip does 71% and the top 
> scorer does 89%. Most of the top-tier compressors are proprietary, but 

Yes.  But in many cases we might actually be able to do even better
by going with a pack-wide dictionary.  Why?

Think about source code structure.  E.g.

  $ git grep --cached 'struct object'| cut -d: -f1|wc -l
     402

So 402 files in git.git use the term 'struct object', and that's just
the current revision I had in my index.  With our current packfile
organization we are likely to store this string at least 402 times.
We'll store it once in each file's delta chain, assuming each
file's blobs largely fall into a single delta chain for that file
(reasonable assumption, but certainly not always true).

That's just one string that does appear somewhat frequently in any
file its used in.  Now try 'unsigned char' (its 944 files, but an
even higher frequency-per-file).

So anyway, for the past year I've been thinking about trying to
implement a blob-level dictionary prototype to see if it helps on a
project like linux-2.6.git, but I haven't gotten to it.  The pack v4
work was about applying that basic dicationary principal to trees
and commits, and I think it pays off nicely there.  Just need to
get it cleaned up, rebased onto current master, and submitted to
the list for wider testing.  ;-)

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux