Re: Finer timestamps and serialization in git

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 21 May 2019 02:08:25 +0200

"Eric S. Raymond" <esr@xxxxxxxxxxx> writes:
> Jakub Narebski <jnareb@xxxxxxxxx>:

>>> What "commits that follow it?" By hypothesis, the incoming commit's
>>> timestamp is bumped (if it's bumped) when it's first added to a branch
>>> or branches, before there are following commits in the DAG.
>> 
>> Errr... the main problem is with distributed nature of Git, i.e. when
>> two repositories create different commits with the same
>> committer+timestamp value.  You receive commits on fetch or push, and
>> you receive many commits at once.
>> 
>> Say you have two repositories, and the history looks like this:
>> 
>>  repo A:   1<---2<---a<---x<---c<---d      <- master
>> 
>>  repo B:   1<---2<---X<---3<---4           <- master
>> 
>> When you push from repo A to repo B, or fetch in repo B from repo A you
>> would get the following DAG of revisions
>> 
>>  repo B:   1<---2<---X<---3<---4           <- master
>>                  \
>>                   \--a<---x<---c<---d      <- repo_A/master
>> 
>> Now let's assume that commits X and x have the came committer and the
>> same fractional timestamp, while being different commits.  Then you
>> would need to bump timestamp of 'x', changing the commit.  This means
>> that 'c' needs to be rewritten too, and 'd' also:
>> 
>>  repo B:   1<---2<---X<---3<---4           <- master
>>                  \
>>                   \--a<---x'<--c'<--d'     <- repo_A/master
>
> Of course that's true.  But you were talking as though all those commits
> have to be modified *after they're in the DAG*, and that's not the case.
> If any timestamp has to be modified, it only has to happen *once*, at the
> time its commit enters the repo.

The time commit 'x' was created in repo A there was no need to bump the
timestamp.  Same with commit 'X' in repo B (well, unless there is a
central serialization server - which would not fly).  It is only after
push from repo A to repo B that we have two commits: 'x' and 'X' with
the same timestamp.

> Actually, in the normal case only x would need to be modified. The only
> way c would need to be modified is if bumping x's timestamp caused an
> actual collision with c's.
>
> I don't see any conceptual problem with this.  You appear to me to be
> confusing two issues.  Yes, bumping timestamps would mean that all
> hashes downstream in the Merkle tree would be generated differently,
> even when there's no timestamp collision, but so what?  The hash of a
> commit isn't portable to begin with - it can't be, because AFAIK
> there's no guarantee that the ancestry parts of the DAG in two
> repositories where copies of it live contain all the same commits and
> topo relationships.

Errr... how did you get that the hash of a commit is not portable???
Same contents means same hash, i.e. same object identifier.  Two
repositories can have part of history in common (for example different
forks of the same repository, like different "trees" of Linux kernel),
sharing part of DAG.  Same commits, same topo relationships.  That's how
_distributed_ version control works.

[I think we may have been talking past each other.]

>> And now for the final nail in the coffing of the Bazaar-esque idea of
>> changing commits on arrival.  Say that repository A created new commits,
>> and pushed them to B.  You would need to rewrite all future commits from
>> this repository too, and you would always fetch all commits starting
>> from the first "bumped"
>
> I don't see how the second clause of your last sentence follows from the
> first unless commit hashes really are supposed to be portable across
> repositories.  And I don't see how that can be so given that 'git am'
> exists and a branch can thus be rooted at a different place after
> it is transported and integrated.

'git rebase', 'git rebase --interactive' and 'git am' create diffent
commits; that is why their's result is called "history rewriting" (it
actually is creating altered copy, and garbage-collecting old pre-copy
and pre-change version).  Anyway, the recommended practice is to not
rewrite published history (where somebody could have bookmarked it).

Note also that this copy preserves author date, not committer date; also
commits can be deleted, split and merged during "rewrite".

Fetch and push do not use 'git am', and they preserve commits and their
identities.  That is how they can be effective and peformant.

>> Hash of a commit depend in hashes of its parents (Merkle tree). That is
>> why signing a commit (or a tag pointing to the commit) signs a whole
>> history of a commit.
>
> That's what I thought.

[...]
>> For cookie to be unique among all forks / clones of the same repository
>> you need either centralized naming server, or for the cookie to be based
>> on contents of the commit (i.e. be a hash function).
>
> I don't need uniquess across all forks, only uniqueness *within the repo*.

Err, what?  So the proposed "action stamp" identifier is even more
useless?  If you can't use <esr@xxxxxxxxxxx!2019-05-15T20:01:15.473209800Z>
to uniquely name revision, so that every person that has that commit can
know which commit is it, what's the use?

Is "action stamp" meant to be some local identifier, like Mercurial's
Subversion-like revision number, good only for local repository?

> I want this for two reasons: (1) so that action stamps are unique, (2)
> so that there is a unique canonical ordering of commits in a fast export
> stream.
>
> (Without that second property there are surgical cases I can't
> regression-test.)

You can always use object identifier (hash) for tiebreaking for second
case use.

>>>                                                          For my use cases
>>> that cookie should *not* be a hash, because hashes always break N years
>>> down.  It should be an eternally stable product of the commit metadata.
>> 
>> Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
>> having a flag day, and providing full interoperability between
>> repositories and Git installations using the old hash ad using new
>> hash^1.  This will be done internally by using SHA-1 <--> SHA-256
>> mapping.  So after the transition all you need is to publish this
>> mapping somewhere, be it with Internet Archive or Software Heritage.
>> Problem solved.
>
> I don't see it.  How does this prevent old clients from barfing on new
> repositories?

The SHA-1 <--> SHA-256 interoperation is on the client-server level; one
can use old Git that uses SHA-1 from repository that uses SHA-256, and
vice versa.

>> P.S. Could you explain to me how one can use action stamp, e.g.
>> <esr@xxxxxxxxxxx!2019-05-15T20:01:15.473209800Z>, to quickly find the
>> commit it refers to?  With SHA-1 id you have either filesystem pathname
>> or the index file for pack to find it _fast_.
>
> For the purposes that make action stamps important I don't really care
> about performance much (though there are fairly obvious ways to
> achieve it).

What ways?

>              My goal is to ensure that revision histories (e.g. in
> their import-stream format) are forward-portable to future VCSes
> without requiring any data outside the stream itself.

In Git you can store "action stamp" in extra extension headers in commit
objects (as was already proposed in this thread).

Best,
--
Jakub Narębski