Re: Finer timestamps and serialization in git

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 15 May 2019 22:20:03 +0200

On Wed, May 15 2019, Eric S. Raymond wrote:

> The recent increase in vulnerability in SHA-1 means, I hope, that you
> are planning for the day when git needs to change to something like
> an elliptic-curve hash.  This means you're going to have a major
> format break. Such is life.

Note that most users of Git (default build options) won't be vulnerable
to the latest attack (or SHAttered), see
https://public-inbox.org/git/875zqbx5yz.fsf@xxxxxxxxxxxxxxxxxxx/T/#u

But yes the plan is to move to SHA-256. See
https://github.com/git/git/blob/next/Documentation/technical/hash-function-transition.txt

> Since this is going to have to happen anyway

The SHA-1 <-> SHA-256 transition is planned to happen, but there's some
strong opinions that this should be *only* for munging the content for
hashing, not adding new stuff while we're at it (even if optional). See
: https://public-inbox.org/git/87ftyyedqd.fsf@xxxxxxxxxxxxxxxxxxx/

> let me request two
> functional changes in git. Neither will be at all difficult, but the
> first one is also a thing that cannot be done without a format break,
> which is why I have not suggested them before.  They come from lots of
> (often painful) experience with repository conversions via
> reposurgeon.
>
> 1. Finer granularity on commit timestamps.

If you wanted milli/micro/nano-second timestamps for commit objects or
whatever other new info then it doesn't need to break the commit header
format.

You put it key-values in the commit message and read it back out via
git-interpret-trailers.

Or even put it in the header itself, e.g.:

author <name> <epoch> <tz>
committer <name> <epoch> <tz>
x-author-ns <nanosecond part of author>
x-committer-ns <nanosecond part of committer>

Of course nobody would understand that new thing from day one, but
that's nothing compared to breaking the existing header format.

> 2. Timestamps unique per repository
>
> The coarse resolution of git timestamps, and the lack of uniqueness,
> are at the bottom of several problems that are persistently irritating
> when I do repository conversions and surgery.
>
> The most obvious issue, though a relatively superficial one, is that I have
> to thow away information whenever I convert a repository from a system with
> finer-grained time.  Notably this is the case with Subversion, which keeps
> time to milliseconds. This is probably the only respect in which its data
> model remains superior to git's. :-)

Should be solved by putting it in the commit as noted above, just not in
the very narrow part of the object that's reserved and not going to
change.

More generally plenty of *->git importers write some extra data in the
commits, usually in the commit message. Try e.g. cloning a SVN repo with
"git svn clone" and see what it does.

> The deeper problem is that I want something from Git that I cannot
> have with 1-second granularity. That is: a unique timestamp on each
> commit in a repository. The only way to be certain of this is for git
> to delay accepting integration of a patch until it can issue a unique
> time mark for it - obviously impractical if the quantum is one second,
> but not if it's a millisecond or microsecond.
>
> Why do I want this? There are number of reasons, all related to a
> mathematical concept called "total ordering".  At present, commits in
> a Git repository only have partial ordering. One consequence is that
> action stamps - the committer/date pairs I use as VCS-independent commit
> identifications in reposurgeon - are not unique.  When a patch sequence
> is applied, it can easily happen fast enough to give several successive
> commits the same committer-ID and timestamp.
>
> Of course the commit hash remains a unique commit ID.  But it can't
> easily be parsed and followed by a human, which is a UX problem when
> it's used as a commit stamp in change comments.

You cannot get a guaranteed "total order" of any sort in anything like
git's current object model without taking a global lock on all write
operations.

Otherwise how would two concurrent ref updates / object writes be
guaranteed not to get the timestamp? Unlikely with nanosecond accuracy,
but not impossible.

Even if you solve that, take two such repositories and "git merge
--allow-unrelated-histories" them together. Now what's the order?

These issues are solved by defining ordering in terms of the graph, and
writing this information after-the-fact. That's already part of git. See
https://github.com/git/git/blob/next/Documentation/technical/commit-graph.txt
and
https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-ii-file-format/

> More deeply, the lack of total ordering means that repository graphs
> don't have a single canonical serialized form.  This sounds abstract
> but it means there are surgical operations I can't regression-test
> properly.  My colleague Edward Cree has found cases where git fast-export
> can issue a stream dump for which git fast-import won't necessarily
> re-color certain interior nodes the same way when it's read back in
> and I'm pretty sure the absence of total ordering on the branch tips
> is at the bottom of that.

Can you clarify what you mean by this? You run fast-import twice and get
different results, is that it? If so that sounds like a bug.

> I'm willing to write patches if this direction is accepted.  I've figured
> out how to make fast-import streams upward-compatible with finer-grained
> timestamps.