Re: Packfile can't be mapped

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Linus Torvalds <torvalds@xxxxxxxx> wrote:
> 
> On Mon, 28 Aug 2006, Nicolas Pitre wrote:
> > 
> > Good job indeed.  Oh and you probably should not bother trying to 
> > deltify commit objects at all since that would be a waste of time.
> 
> It might not necessarily always be a waste of time. Especially if you have 
> multiple branches tracking a "maintenance" branch, you often end up having 
> the same commit message repeated several times in "unrelated" commits 
> (they're really the same commit, applied to another branch).
> 
> Also, I could imagine that some automated system generates very verbose 
> (and possibly very regular) commit messages, so under certain 
> circumstances it may well make sense to see if the commits migth delta 
> against each other.
> 
> But I'll agree that in normal use it's not likely to be a huge saving, 
> though. It's probably not worth doing for the fast importer unless it just 
> happens to fall out of the code very easily.

Does git-pack-objects attempt to delta commits against each other?


I've been thinking about applying a pack-local but zlib-stream
global dictionary.  If we added three global dicationaries to the
front of the pack file, one for commits, one for trees and one
for blobs, and use those as the global dictionaries for the zlib
streams stored within that pack we could probably get a good space
savings for trees and commits.

I'd suspect that for many projects the commit global dictionary
would contain the common required strings such as:

  'tree ', 'parent ', 'committer ', 'author ', 'Signed-off-by: '

plus the top author/committer name/email combination strings.
For GIT I'd expect 'Junio C Hamano <junkio@xxxxxxx>' to be way up
there in terms of frequency within commit objects.  Finding the most
common authors and committer strings would be trivial, as would
finding the most common 'footer' strings such as 'Signed-off-by: '
and 'Acked-by: '.

I think the same is true of trees, with '10644 ', '10755 ', '40000 '
being way up there, but also file names that commonly appear within
trees, e.g. "Makefile.in\0".

Blobs would be more difficult to generate a reasonable global
dictionary for.  But for some projects a crude estimated dictionary
can shave off at least 4% of pack size (true in both GIT and Mozilla
sources it seems).


Of course the major problem with pack-local, stream global
dictionaries is it voids the ability to reuse that zlib'd content
from that pack in another pack without wholesale copying the
dictionary as well.  This is an issue for servers which want to
copy out the pack entry without recompressing it but also want the
storage savings from the global dictionaries.

But then again, if we just delta against a commit which uses the
same author and committer, or against the same tree but different
version then there should be a lot of delta copying from the base...
which easily allows entry reuse and should provide similiar space
savings, providing the delta depth is deep enough (or the delta graph
is wide enough) to minimize the number of base objects containing
repeated occurrances of the common strings.

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]