Re: Calculating tree nodes

Andreas Ericsson <ae@xxxxxx> · Tue, 04 Sep 2007 16:41:14 +0200

Jon Smirl wrote:
On 9/4/07, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
Andreas Ericsson <ae@xxxxxx> wrote:
Jon Smirl wrote:
On 9/4/07, David Tweed <david.tweed@xxxxxxxxx> wrote:
On 9/4/07, Jon Smirl <jonsmirl@xxxxxxxxx> wrote:
Git has picked up the hierarchical storage scheme since it was built
on a hierarchical file system.
...
One of the nice things about tree nodes is that for doing a diff
between versions you can, to overwhelming probability, decide
equality/inequality of two arbitrarily deep and complicated subtrees
by comparing 40 characters, regardless of how remote and convoluted
their common ancestry. With delta chains don't you end up having to
trace back to a common "entry" in the history? (Of course, I don't
know how packs affect this - presumably there's some delta chasing to
get to the bare objects as well.)
While it is a 40 character compare, how many disk accesses were needed
to get those two SHAs into memory?
One more than there would have been to read only the commit, and one more
per level of recursion, assuming you never ever pack your repository.

If you *do* pack it, the tree(s) needed to compare are likely already
inside the sliding packfile window. In that case, there are no extra
disk accesses.
Even better, lets do some back of the napkin math on the Linux
kernel tree.  My local (out of date but close enough) copy has
22,730 files in the tip revision.  Values shown are uncompressed
and compressed (gzip -9 | wc -c), but are excluding deltification.

                 Current Scheme       Jon's Flat Scheme
                 -----------------    -----------------
commit raw       932                  932 + 22,730*20 = 455,532
(compressed)     521                  456,338

root tree raw    876                  0
(compressed)     805                  0

This is not a fair comparison. The current scheme is effectively
diffed against the previous version. You aren't showing an equivalent
diff for the flat scheme. Both schemes are dealing with the same
22,000 SHAs.

How, with your scheme, would you solve

	git diff -M master pu

in the git repo?

You'd have to build both trees completely, utilizing the last known
complete tree-listing (the root commit, since you propose to do away
with trees altogether) and then applying diffs on top of that to
finally generate an in-memory tree-structure in which you will have
to compare every single file against every single other file to find
out which code has been moved/copied/renamed/whatever.

That's (n*(n+1))/2 operations for file-level diffs alone. For the
kernels 22730 files, you're looking at 258337815 file comparisons
without the tree objects.

Sure, you can probably shave away quite a few of those comparisons
at the expense of computing the tree-hashes on the fly, but in that
case, why get rid of them in the first place?

The size win is from diffing, not compressing.

It was declared in May 2006 by someone insightful that diskspace
and bandwidth are cheap, while human time is priceless.

IOW, size wins had better be proportionally huge to justify slowing
git down and thereby taking more than necessary of the users' time.

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html