Re: Achieving efficient storage of weirdly structured repos

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Fri, 4 Apr 2008 23:24:45 -0400

Nicolas Pitre <nico@xxxxxxx> wrote:
> On Thu, 3 Apr 2008, Jakub Narebski wrote:
> 
> > One of bigger hindrances, as I understand it, in developing pack v4
> > was the fact that it didn't offer that much of improvement in typical
> > cases for the work needed... but perhaps "your" repository would be
> > good showcase for pack v4.
> 
> The biggest hindrance for pack v4 is actually the lack of a native 
> runtime tree walking, and having both tree object formats properly and 
> optimally abstracted has not been looked at yet.
> 
> Speed is the primary goal for pack v4.  The fact that it also provides a 
> 10% pack reduction is only consequential.  But without native tree 
> walking we must recreate the legacy tree format on the fly each time a 
> tree object is loaded which dwarfs any improvements pack v4 is aiming 
> for (yes it is still a little bit faster than pack v3 nevertheless, but 
> not yet significantly enough to overcome the incompatibility costs).

Even though we don't have native tree walking, I think the right
way to do this is to put in pack v4 with "canonical tree, canonical
commit" mode, where it inflates its native tree/commit encoding
into the canonical forms, then come back later with native walking.

Canonical mode is still faster than pack v2 inflate is for these
types, so it does (slightly) boost rev-list performance.  It might
chop a solid 30% off the CPU time jgit spends in its equivilant of
revision.c, and that's without teaching jgit to use the native pack
v4 encoding directly.

Once we have it in we can experiment with the necessary abstractions
to handle the two different available encodings, and allowing
higher level code to switch back and forth between them as objects
come from loose or pack v2, and from pack v4.  One of the things we
wanted to do was boost path limiter performance by matching on tree
name ids when walking a pack v4 native tree, but fall back to the
string based memcmp when walking a canonical tree.  That won't be
easy to design without the two different encodings being available
at the lower level in sha1_file.c.

Just my rapidly declining .02 bush peso.

> Nicolas (who wishes he was still a student with plenty of hacking time)

Don't we all.  :-)

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html