Re: Git Notes idea.

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Wed, 17 Dec 2008 00:48:02 +0100 (CET)

Hi,

On Tue, 16 Dec 2008, Govind Salinas wrote:

> I was thinking I would do my first implementation in pyrite and if I 
> find that it works well I will port it.

Given that there are a lot of building blocks in C already, I think it 
would be a waste of your time.

> I just read the proposal from Johannes, he seems to want to use a 
> similar layout.  However, I would like to modify my proposal slightly to 
> make it work better when a gc is run.  I would modify the tree to look 
> like this...
> 
> let 1234567890123456789012345678901234567890 be the
> id of the item that is annotated.
> 
> let abcdef7890123456789012345678901234567890 be the
> id of the note to be attached
> 
> root/
>      12/
>          34567890123456789012345678901234567890/
>              abcdef7890123456789012345678901234567890
> 
> This way all the notes are attached to a tree, so that gc won't
> think they are unreferenced objects.

In my proposal back then, your root/12/345... would be a blob, in Peff's, 
it would be a tree, and in both cases the blobs/trees would be referenced 
by refs/notes, so git gc would not kill them either.

The bigger issue is that commit objects can be gc'ed, and then their notes 
should be gc'ed, too.

And of course, there is the rebase issue (which I completely missed; I 
will read the mail Jeff referenced tomorrow).

Speaking about the blobs vs trees issue, I think it is no issue, as Peff 
and me already discussed: the notes could check if it is a tree or a 
blob, and handle both easily.

> Peff wrote:
>
> > It is easy, but it's slow because we have to do a linear search in the 
> > (potentially huge) notes tree. And that's what held up the initial 
> > implementation. I did a proof-of-concept a month or so ago that 
> > pre-seeded an in-memory hash using the tree contents and got pretty 
> > reasonable performance.
> 
> Perhaps I am missing something, how is it a linear search?

Yes, you are missing what I wrote in the original thread: tree objects 
must be read in a forward direction, one by one.

IIRC back then, Junio and/or Linus suggested that you could backward 
search with heuristics finding the beginning of a tree entry, and thus you 
could kind of bisect the tree to search for a specific tree entry (since 
tree objects have the contents sorted), but I presented a case where this 
breaks down at the GitTogether:

Tree entries consist of a mode (as 6-byte ASCII representing the octal 
value), then SPC, then a NUL-terminated path, and then a 20-byte SHA-1.  
(Just hexdump the output of "git cat-file tree HEAD:" in any repository to 
see it).

The only heuristic you could apply to find your way backward (or forward) 
to find the beginning of a tree entry would be to find the NUL character 
and verify that exactly 20 bytes after it, either the tree object ends, or 
there is a valid octal number followed by a SPC.

The only thing you would need for this heuristic to break down is a SHA-1 
which contains a \x00 (which is then mistaken for a string termination), 
and part of a filename that could be mistaken to be an octal number.

Take for example the SHA-1 of git-gui in v1.6.0.5, which has a NUL its 
14th byte, and just assume that the next tree entry has mode 100644 
and name "4040000 Some financial record.txt".

Then, the NUL could be mistaken for the end of the previous tree entry, 
and "040000 " as the mode for the next one, whose name would be assumed to 
be "Some financial record.txt".

Granted, if you find two NULs in 21 bytes, one of them must be the 
termination of the path, but which one?

Worse, even if you would find a method (complicated, and therefore 
necessarily fragile) to find the boundary of the tree entries reliably, 
you would _still_ have a linear time unpacking the darned tree object in 
the first place.

So you cannot do that for every commit you encounter and expect not to die 
of boredom in the process.

Peff's very cute idea was to decouple that process from the per-commit 
procedure, and basically make it a one-time cost (per Git call, and only 
when notes were asked for).

It will still be linear in the number of notes, but it would then be in a 
hashmap, with an expected linear cost per commit.

> Also, how large do you expect the list to be under reasonable 
> circumstances.

We did not intend Git to be used as a backup tool, did we?

One of the _worst_ design decisions is to limit yourself by expectations.

> >  Some discussion of the interaction of notes and rebase:
> >  http://thread.gmane.org/gmane.comp.version-control.git/100533
> 
> On rebasing, I have a couple thoughts.
>
> 1) I think it only really makes sense to make a public annotation to a 
>    commit that is public, and once a commit is public it should not be 
>    rebased.

Again, do not limit your design by your expectations.  People already talk 
about having cover letters for their patch series as notes, and Pasky 
seems to discuss tracking explicit renames with notes when he does not 
play Go, instead of maintaining repo.or.cz and git.or.cz.

> 2) We could also annotate the commit's tree instead of the commit.  
>    That would make it somewhat resistant to rebases, cherry-picks and 
>    amends.

To the contrary.  When I rebase, the tree _does_ change, otherwise I would 
have rebased onto something that had the same original tree as my 
rebase-base to begin with, which would make the rebase rather pointless.

> >  Some thoughts from me on naming issues: 
> >  http://article.gmane.org/gmane.comp.version-control.git/100402
> 
> On naming.  I strongly support a ref/notes/sha1/sha1 approach.

I think you meant refs/notes:<first byte in hex>/<rest of bytes>/<some 
arbitrary SHA-1>?

I am rather supporting refs/nots:<first byte in hex>/<rest of bytes> being 
either a blob, or a tree containing human readable tags, such as "bugfix" 
or "review" or some such.

> If having a type to the note is important, then perhaps the first line 
> of a note could be considered a type or a set of "tags".

That would be horrible!  Just to know if you need to unpack the blob, 
you'd have to unpack it!

Hth,
Dscho

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html