Re: pack v4 status

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Tue, 27 Feb 2007 22:45:55 -0500

Junio C Hamano <junkio@xxxxxxx> wrote:
> Nicolas Pitre <nico@xxxxxxx> writes:
> > The idea is to deal with only tree objects containing the 64K most 
> > frequently used base names and fall back to the current tree object 
> > encoding for objects that couldn't be represented that way.
> 
> Ah, I was wondering the same thing as Linus after seeing shawn
> talked about the 2-byte prefix on #git. Falling back to an
> alternate encoding for rarer cases makes sense.

Right.  Git is already fast, and already compresses the object data
very well.  But I think we can make things faster without violating
the basic assumptions of "whole project history", and it just turns
out that those encodings are also making the data smaller for the
common case of human maintained source code.  Which of course is
one of the primary uses for Git, but is obviously not the only use.

In the worst case scenario we'll be doing exactly what we are
doing today with regards to encoding. That performance and disk
space usage is already known and considered "very, very fast" and
"very small".  ;-)

In the best case scenario (human managed source like linux.git,
git.git) we'll scream with pack v4.  The rev-list stats I posted
from just the tree encoding switch not only saved 3 MiB of disk
space but improved total running time by 12.5%.  Nico and I know
we can still do better.

With 15k basenames in linux.git we're filling only 23.6% of the
available namespace within a single packfile.  I think that by the
time we have enough basenames to break 64K we'll be several years
out and be talking about historical packs vs. active packs.

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html