Re: Finer timestamps and serialization in git

"Eric S. Raymond" <esr@xxxxxxxxxxx> · Mon, 20 May 2019 10:14:17 -0400

Jakub Narebski <jnareb@xxxxxxxxx>:
> > What "commits that follow it?" By hypothesis, the incoming commit's
> > timestamp is bumped (if it's bumped) when it's first added to a branch
> > or branches, before there are following commits in the DAG.
> 
> Errr... the main problem is with distributed nature of Git, i.e. when
> two repositories create different commits with the same
> committer+timestamp value.  You receive commits on fetch or push, and
> you receive many commits at once.
> 
> Say you have two repositories, and the history looks like this:
> 
>  repo A:   1<---2<---a<---x<---c<---d      <- master
> 
>  repo B:   1<---2<---X<---3<---4           <- master
> 
> When you push from repo A to repo B, or fetch in repo B from repo A you
> would get the following DAG of revisions
> 
>  repo B:   1<---2<---X<---3<---4           <- master
>                  \
>                   \--a<---x<---c<---d      <- repo_A/master
> 
> Now let's assume that commits X and x have the came committer and the
> same fractional timestamp, while being different commits.  Then you
> would need to bump timestamp of 'x', changing the commit.  This means
> that 'c' needs to be rewritten too, and 'd' also:
> 
>  repo B:   1<---2<---X<---3<---4           <- master
>                  \
>                   \--a<---x'<--c'<--d'     <- repo_A/master

Of course that's true.  But you were talking as though all those commits
have to be modified *after they're in the DAG*, and that's not the case.
If any timestamp has to be modified, it only has to happen *once*, at the
time its commit enters the repo.

Actually, in the normal case only x would need to be modified. The only
way c would need to be modified is if bumping x's timestamp caused an
actual collision with c's.

I don't see any conceptual problem with this.  You appear to me to be
confusing two issues.  Yes, bumping timestamps would mean that all
hashes downstream in the Merkle tree would be generated differently,
even when there's no timestamp collision, but so what?  The hash of a
commit isn't portable to begin with - it can't be, because AFAIK
there's no guarantee that the ancestry parts of the DAG in two
repositories where copies of it live contain all the same commits and
topo relationships.

> And now for the final nail in the coffing of the Bazaar-esque idea of
> changing commits on arrival.  Say that repository A created new commits,
> and pushed them to B.  You would need to rewrite all future commits from
> this repository too, and you would always fetch all commits starting
> from the first "bumped"

I don't see how the second clause of your last sentence follows from the
first unless commit hashes really are supposed to be portable across
repositories.  And I don't see how that can be so given that 'git am'
exists and a branch can thus be rooted at a different place after
it is transported and integrated.

> Hash of a commit depend in hashes of its parents (Merkle tree). That is
> why signing a commit (or a tag pointing to the commit) signs a whole
> history of a commit.

That's what I thought.

> > How do they know they're not writing to the same ref?  What keeps
> > *that* operation atomic?
> 
> Because different refs are stored in different files (at least for
> "live" refs that are stores in loose ref format).  The lock is taken on
> ref (to update ref and its reflog in sync), there is no need to take
> global lock on all refs.

OK, that makes sense.

> For cookie to be unique among all forks / clones of the same repository
> you need either centralized naming server, or for the cookie to be based
> on contents of the commit (i.e. be a hash function).

I don't need uniquess across all forks, only uniqueness *within the repo*.

I want this for two reasons: (1) so that action stamps are unique, (2)
so that there is a unique canonical ordering of commits in a fast export
stream.

(Without that second property there are surgical cases I can't
regression-test.)

> >                                                          For my use cases
> > that cookie should *not* be a hash, because hashes always break N years
> > down.  It should be an eternally stable product of the commit metadata.
> 
> Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
> having a flag day, and providing full interoperability between
> repositories and Git installations using the old hash ad using new
> hash^1.  This will be done internally by using SHA-1 <--> SHA-256
> mapping.  So after the transition all you need is to publish this
> mapping somewhere, be it with Internet Archive or Software Heritage.
> Problem solved.

I don't see it.  How does this prevent old clients from barfing on new
repositories?

> P.S. Could you explain to me how one can use action stamp, e.g.
> <esr@xxxxxxxxxxx!2019-05-15T20:01:15.473209800Z>, to quickly find the
> commit it refers to?  With SHA-1 id you have either filesystem pathname
> or the index file for pack to find it _fast_.

For the purposes that make action stamps important I don't really care
about performance much (though there are fairly obvious ways to
achieve it).  My goal is to ensure that revision histories (e.g. in
their import-stream format) are forward-portable to future VCSes
without requiring any data outside the stream itself.

Please remember that I'm accustomed to maintaining infrastructure on
decadal timescales - I wrote code in the 1980s that is still in wide use
and I expect some of the code I'm writing now to be still in use thirty
years from now.

This gives me a different perspective on the fragility of things like
SHA-1 hashes.  From a decadal-scale POV any particular crypto-hash
format is unstable garbage, and having them in change comments is a
maintainability disaster waiting to happen.

Action stamps are specifically designed so that they're pointers to commits
that don't require anything but the target commit's import/export-stream
metadata to resolve.  Your idea of an archived hash registry makes me
extremely nervous; I think it's too fragile to trust.

So let me back up a step.  I will cheerfully drop advocating bumping
timestamps if anyone can tell me how a different way to define a per-commit
reference cookie that (a) is unique within its repo, and (b) only requires
metadata visible in the fast-export representation of the commit.
-- 
		<a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>