Re: What happened with git-note?

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Wed, 11 Jul 2007 02:17:05 +0100 (BST)

Hi,

[Johan, I Cc'ed you, since you were the person pushing for git-notes 
originally. Junio, I Cc'ed you, because you correctly pointed out the 
flaws in my approach.]

On Tue, 10 Jul 2007, Alberto Bertogli wrote:

> There was a very long thread about a proposed feature, git-note 
> (http://thread.gmane.org/gmane.comp.version-control.git/46770/focus=48540), 
> around May this year.
> 
> I was wondering what happened to it, because some issues were brought up
> but then everything seem to have stalled.
> 
> Is someone working on it? Is there a clear list of the pending issues
> somewhere?

Okay, I seem to be at fault for the stalling.  So I better speak up about 
what my proof-of-concept did:

It creates a new ref, "refs/notes/commits", which had a tree object in it. 
The tree entries were actually fan-out directories a la .git/objects/??/. 
So if you added a not to commit abcdef01234..., it created a "file" named 
refs/notes/commits:ab/cdef01234...

The advantages of this approach, as compared to an external list, or even 
SQLite database were, amongst others:

- you can use most of the existing framework (storage, pruning, etc.),

- there is an obvious way how to fetch the notes from someone else,

- there is an obvious way how to deal with conflicts when merging notes
  from someone else,

- it is conceptually trivial.

However, the biggest disadvantage was pointed out by Junio:

The tree objects under refs/notes/commits:??/ can get pretty large, in 
theory.  Suppose you have a rate of a couple of hundred commits a day, and 
you want to use the notes framework, amongst others, to be able to add 
Acked-by's all over the place.  Pretty soon we are in the hundreds of 
entries in the tree objects.

I thought that this is not a problem, since the names all have the same 
length.  Alas, Git is general purpose, so it does not, and cannot, assume 
a certain filename format.  Git might be considered a stupid content 
tracker, but it is not quite that stupid.

Since the tree objects just contain lists of tree entries, and each tree 
entry is variable length, you have to look through half of the entries on 
average, before you find the one you look for.

Possible solutions I can come up with:

There was a lot of talk, and quite a few good ideas, in the packv4 side 
branch, which was undertaken by Nico and Shawn.  The basic idea, IIUC, is 
to store objects differently, by having nice things like a filename table 
containing the most common names, sorted, which are then not only 
addressed much more efficiently (by a 16-bit index), but reverse indexing 
is also much more efficient, since you do not have to expand and compare 
the full filenames, but instead compare 16-bin numbers.

This format would help with the approach I presented above.

Another idea which just crossed my mind, is to add indexes which are 
generated locally, but not transmitted when fetching, much like the pack 
indices, which are generated from the pack while fetching it.

There are a couple of possible technical ways to do that efficiently, such 
as mmap()ed rbtrees, or hashtables.  Of course, Git's own history suggests 
to prefer hashtables over other data structures, when your keys are 
already pretty good hashes, which they are.  However, I have not thought 
it through, what with the issue of growing the data structure, or 
thrashing the I/O caches.

Of course, one does not have to go with the approach of storing the things 
as blobs.  There might well be a genius coming along, reading this mail, 
and proposing the uber-solution.  But dammit, I am pretty sure that we 
have a real good framework, and that a proper solution is just a stone's 
throw away.

Ciao,
Dscho

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html