Re: [PATCHv5 00/14] git notes

Johan Herland <johan@xxxxxxxxxxx> · Wed, 09 Sep 2009 00:46:45 +0200

On Tuesday 08 September 2009, Johannes Schindelin wrote:
> On Tue, 8 Sep 2009, Johan Herland wrote:
> > On Tuesday 08 September 2009, Johannes Schindelin wrote:
> > > I can see that some people may think that date-based fan-out is the
> > > cat's ass, but I have to warn that we have no idea how notes will be
> > > used,
> >
> > I don't agree. Although we will certainly see many more use cases for
> > notes, I believe that the vast majority of them can be placed in one of
> > two categories:
> 
> My experience with Git is that having beliefs how my work is used was a
> constant source of surprise.

And you believe that a system that only allows SHA1-based fanout schemes is 
better equipped to tackle such surprises than a system that provides both 
date-based and SHA1-based fanout schemes?

> > > - I find the restriction to commits rather limiting.
> >
> > I see your point, but I don't agree until I see a compelling case for
> > annotating a non-commit.
> 
> My point is that it is too late by then, if you don't allow for a
>  flexible and still efficient scheme.

As I replied to Junio, we could use the epoch as a "commit date" for tree 
and blob objects, thus making them representable in a date-based fanout 
scheme (although if there are a signficant number of non-commit notes, the 
code should be smart enough to find a better (SHA1-based probably) fanout 
scheme for those notes).

> > > - most of the performance difference between the date-based and the
> > >   SHA-1 based fan-out looks to me as if the issue was the top-level
> > >   tree. Basically, this tree has to be read _every_ time _anybody_
> > >   wants to read a note.
> >
> > Not sure what you're trying to say here. The top-level notes tree is
> > read (as in fill_tree_descriptor()) exactly _once_. After that, it is
> > cached by the internal data structure (until free_commit_notes() or
> > end-of-process).
> 
> By that reasoning, we do not need any fan-out scheme.
> 
> Keep in mind: reading a large tree object takes a long time.  That's why
> we started fan-out.  Reading a large number of tree objects also takes a
> long time.  That's why I propagated flexible fan-out that is only read-in
> on demand.

Not sure where you're going with this. Of course we want to strike an 
optimal balance between the size of tree objects and the number of tree 
objects. Nobody is arguing about that. Both SHA1-based and (in the most 
common cases) date-based schemes can be used to achieve this balance. But 
using date-based fanout has the added advantage of providing better 
performance (both runtime- and memory-wise) when looking up notes in a 
chronological order.

> > > But I think that having a dynamic fan-out that can even put blobs
> > > into the top-level tree (nothing prevents us from doing that, right?)
> >
> > Well, the "flexible" code does add the new requirement that all entries
> > in a notes (sub)tree object must follow the same scheme, i.e. you
> > cannot have:
> >
> >   /12/34567890123456789012345678901234567890
> >   /2345/678901234567890123456789012345678901
> >
> > but you can have
> >
> >   /12/34567890123456789012345678901234567890
> >   /23/45/678901234567890123456789012345678901
> 
> Umm, why?  Is there any good technical reason?

In the date-based parts of notes tree, there are very good reasons for doing 
so: The code peeks at the first tree entry in order to determine what kind 
of date-based fanout is used in the current tree object. Subsequent entries 
(in that tree object) that do not follow the same format are 
skipped/ignored.

The SHA1-based fanout code has not changed since the last iteration, so this 
extra requirement is not absolutely necessary for the SHA1-based parts of 
the notes tree. However, the extra requirement does guarantee that commit 
notes have exactly one unique location in the notes tree, and thus relieves 
us of having to keep searching for alternative notes locations, and 
concatenate the notes found.

> > > The real question for me, therefore, is: what is the optimal way to
> > > strike the balance between size of the tree objects (which we want to
> > > be small, so that unpacking them is fast)  and depth of the fan-out
> > > (which we want to be shallow to avoid reading worst-case 39 tree
> > > objects to get at one single note).
> >
> > s/39/19/ (each fanout must use at least 2 chars of the 40-char SHA1)
> 
> That is another unnecessary restriction that could cost you dearly.  Just
> think what happens if it turns out that the optimal number of tree items
> is closer to 16 than to 255...

The code can easily be rewritten to allow for "odd" fanouts (1/39, 1/1/38, 
etc.). Feel free to submit a patch.

I was, however, naive enough to assume that when git.git decided on using 
2/38 fanout for its loose objects, then some performance-related thoughts
went into that decision. If there are indications that multiple-of-2-type 
fanouts are not optimal, we should probably reconsider.

Have fun! :)

...Johan

-- 
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html