Re: Storing additional information in commit headers

martin f krafft <madduck@xxxxxxxxxxx> · Tue, 2 Aug 2011 10:28:10 +0200

also sprach Jeff King <peff@xxxxxxxx> [2011.08.02.0550 +0200]:
> >   2. It is immutable. Ideally, I would like to store extra
> >      information for a ref in ref/heads/*, but there seems to be no
> >      way of doing this. Hence, I need to store it in commits and
> >      backtrack for it. Or so I think, at least…
> 
> Wait, so you want metadata on a _ref_, not on a commit? That is a very
> different thing, I think. We usually accomplish that with data in
> .git/config. Or if you need to push data between repos, or if it's too
> big to easily fit in the config, then put it in a blob and keep a
> parallel ref structure (e.g., refs/topgit/bases/refs/heads/master).
> 
> Or maybe I'm just misunderstanding.

You nailed it perfectly well. Thank you for taking the time again to
reply to me.

TopGit does what you suggest (a parallel ref structure), but there
are three problems with this, which I am trying to address:

  1. you need to ensure that these refs are pushed and fetched,
     which requires set up and possible migration issues when things
     change, and can cause big problems for contributors who just so
     happened to forget.

  2. the additional refs confuse people a lot — and I can attest to
     that because I have also at times found myself overwhelmed by
     them when staring at gitk.

  3. once a ref updates, we need to keep a pointer to the previous
     location, since one of the goals is the ability to be able to
     return to a point in history (e.g. for security updates to
     a stable package, or backports). Additional refs enhance the
     aforementioned two problems.

Therefore I thought it would be sensible to store these data in
commit. When the data change, there will always be a new commit to
store these data, and we do *not* want to update the data in
previous commits. Finding the data then becomes backtracking the
branch history until a commit is found containing them.

> >   In addition to the standard tree and parent pointers, there could
> >   be *-ref and x-*-ref headers, which take a single ref argument,
> >   presumably to a blob containing more data.
>
> I'm not sure how well-defined that is, though. What does the ref
> mean? What does it point to, and what is the meaning with respect
> to the original commit? Or are you suggesting that "*" would be
> "topgit-base" here, and that git core would understand only that
> any header matching the pattern "x-*-ref" should be followed with
> respect to reachability/pruning. Only the owner of the "*" part
> (topgit in this case) would be able to make sense of the meaning
> of the ref.

Exactly the latter. Sorry for my unannounced use of wildcards in
this context.

> If that is the case, that does make sense to me. It's basically an
> immutable version of a note.
>
> However, implementing such a thing would mean you have an awkward
> transition period where some versions of git think the referenced
> object is relevant, and others do not. That's something we can
> overcome, but it's going to require code in git, and possibly
> a dormant introduction period.

Indeed. This could be adressed by letting a tool like TopGit require
a minimum version of Git. For a while, this will burden developers,
but ensure that it works. Over time, this will cease to be
a problem.

> I suspect you would give git people more warm fuzzies about
> implementing this by showing a system that is built on git-notes
> and saying "this works really well, except that the external note
> storage is not a good reason because { it's mutable, it's not
> efficient, whatever other reason you find}". And then we know that
> the system is proven to work, and that migrating the note-like
> structure into the object is sensible.
>
> But I get the impression you're one step back from that now. So it makes
> sense to me to at least prototype it via git-notes, which will give you
> the same semantic storage (a mapping of commits to some blobs, with
> reachability handled automatically).

I appreciate how you are developing your reasoning, and the advice
you give.

Indeed, I am already prototyping using git-notes, and I designed the
datastore to be extensible, so that I can use other ways to find the
data.

Using pseudo-headers is another (temporary) way to prove the concept
works, but I am afraid that it will become standard too quickly
(because it's so easy), essentially preventing progress into x-*-ref
domain, or forcing us to carry compatibility with us forever.

What do you think about using the idea of orphan parent commits
(OPC) for now? These are conceptually closest to the x-*-ref
pointers, do not require extra setup, pollute history only a little
bit (IMHO), and slot in with Git and fsck/gc alright.

Here's the idea again, graphically:

  o--o--o--●
       /
      #

while at HEAD, I would backtrack history until I found HEAD^, which
has a parent with a well-defined commit message and holding the data
I am looking for.

Later, when x-*-ref is mainline, instead of parent pointers, it can
be used in place.

When there is a merge and the TopGit data need updating, a new
OPC is slotted into place, on the merge commit. In
the following graph, the user then decided also at a later point to
update e.g. the TopGit patch description (.topmsg), which is also
stored in this OPC:

       o--o-o
      /      \      maint       master
  o--o--o--o--+--o--O--o--o--o--●
       /     /           /
      #     #           #

To keep things simple, every OPC copies the unchanged data from the
previous one as well (compression will reduce the overhead).

Later, I can use the maint branch just in the same way I could use
master when it was at that age.

> > > Otherwise, the usual recommendation is to use a pseudo-header
> > > within the body of the commit message (i.e., "Topgit-Base: ..." at
> > > the end of the commit message). The upside is that it's easy to
> > > create, manipulate, and examine using existing git tools. The
> > > downside is that it is something that the user is more likely to
> > > see in "git log" or when editing a rebased commit message.
> > 
> > … to see *and to accidentally mess up*. And while that may even be
> > unlikely, it does expose information that really ought to be hidden.
> 
> I'm not quite sure what the information is, so I can't really judge. Do
> you have a concrete example?
> 
> I got the impression earlier you were wanting to store a human-readable
> text string.  That makes a pseudo-header a reasonable choice. But if you
> are going to reference some blob (which it seems from what you wrote
> above), and you are interested in proper reachability analysis, then no,
> it probably isn't a good idea.

I am not yet sure what information needs storing. Right now, I am
keeping five fields:

  Depend-Refs         A list of the most recent branch points from
                      dependency branchs, so that a tool can tell
                      when the dependent branch needs an update
                      (commits following those refs that are not
                      reachable by the branch head).
  Base-Ref            The ref to the most recent merge of all
                      dependencies, used to create diffs.
  Patch-Branch        boolean to suggest whether this branch is
                      designed to develop a single patch for
                      submission or use in a quilt series.
  Patch-Message       Patch description (think git-send-email).
  Integration-Branch  boolean to suggest whether instead this branch
                      is a branch designed to collect features.

At the moment, I do now know which of those are necessary, and which
I am missing.

The flexibility of being able to store as much as I want, in
whatever format I want, without having to fear overloading the
commit message or burdening the user, is what makes me want to use
refs to blobs.

> I think extensibility is welcome. It's just that most discussions
> so far have ended up realizing that a new header would just be
> cruft. Maybe yours is different. I'm still not 100% sure
> I understand what you want to accomplish, but the idea of an
> x-*-ref header is a reasonable thing for git to have.

I think there are two questions:

  1. would x-*-ref be a suitable idea for Git core?

     I think the answer is yes, as (I think) it's well-defined and
     I cannot see any problems with it, really.

  2. can we prevent abuse?

     No, never. But just like you cannot abuse X-* headers in the
     RFC822 format due to their design, x-*-ref abuse would only
     affect those who chose it.

Thank you,

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"the question of whether computers can think
 is like the question of whether submarines can swim."
                                                 -- edsgar w. dijkstra

spamtraps: madduck.bogus@xxxxxxxxxxx
Attachment:
digital_signature_gpg.asc

Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)