Re: Finer timestamps and serialization in git

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 20 May 2019 01:15:47 +0200

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:
> On Thu, May 16 2019, Eric S. Raymond wrote:
>> Derrick Stolee <stolee@xxxxxxxxx>:
>>> On 5/15/2019 3:16 PM, Eric S. Raymond wrote:
>>>> The deeper problem is that I want something from Git that I cannot
>>>> have with 1-second granularity. That is: a unique timestamp on each
>>>> commit in a repository.
>>>
>>> This is impossible in a distributed version control system like Git
>>> (where the commits are immutable). No matter your precision, there is
>>> a chance that two machines commit at the exact same moment on two different
>>> machines and then those commits are merged into the same branch.
>>
>> It's easy to work around that problem. Each git daemon has to single-thread
>> its handling of incoming commits at some level, because you need a lock on the
>> file system to guarantee consistent updates to it.

As far as I understand it this would slow down receiving new commits
tremendously.  Currently great care is taken to not have to parse the
commit object during fetch or push if it is not necessary (thanks to
things such as reachability bitmaps, see e.g. [1]).

With this restriction you would need to parse each commit to get at
commit timestamp and committer, check if the committer+timestamp is
unique, and bump it if it is not.

Also, bumping timestamp means that the commit changed, means that its
contents-based ID changed, means that all commits that follow it needs
to have its contents changed...  And now you need to rewrite many
commits.  And you also break the assumptions that the same commits have
the same contents (including date) and the same ID in different
repositories (some of which may include additional branches, some of
which may have been part of network of related repositories, etc.).

[1]: https://github.blog/2015-09-22-counting-objects/
     http://githubengineering.com/counting-objects/

> You don't need a daemon now to write commits to a repository. You can
> just add stuff to the object store, and then later flip the SHA-1 on a
> reference, we lock those indivdiual references, but this sort of thing
> would require a global write lock. This would introduce huge concurrency
> caveats that are non-issues now.
>
> Dumb clients matter. Now you can e.g. have two libgit2 processes writing
> to ref A and B respectively in the same repo, and they never have to
> know about each other or care about IPC.
>
> Also, even if you have daemons accepting pushes they can now be on
> different computers sharing things over e.g. an NFS filesystem. Now you
> need some FS-based serialization protcol for commits and their
> timestamps.

Also, performance matters.  Especially for large repositories, and for
large number of repositories.

>> So if a commit comes in that would be the same as the date of the
>> previous commit on the current branch, you bump the incoming commit timestamp.

You do realize that dates may not be monotonic (because of imperfections
in clock synchronization), thus the fact that the date is different from
parent does not mean that is different from ancestor.

>> That's the simple case. The complicated case is checking for date
>> collisions on *other* branches. But there are ways to make that fast,
>> too. There's a very obvious one involving a presort that is is O(log2
>> n) in the number of commits.

I don't think performance hit you would get would be acceptable.

[...]
>>>> Why do I want this? There are number of reasons, all related to a
>>>> mathematical concept called "total ordering".  At present, commits in
>>>> a Git repository only have partial ordering.
>>>
>>> This is true of any directed acyclic graph. If you want a total ordering
>>> that is completely unambiguous, then you should think about maintaining
>>> a linear commit history by requiring rebasing instead of merging.
>>
>> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
>> The presence of timestamps makes a total ordering possible.
>>
>> (I was a theoretical mathematician in a former life. This is all very
>> familiar ground to me.)

Maybe in theory, when all clock are synchronized.  But not in practice.
Shit happens.  Just recently Mike Hommey wrote about the case he has to
deal with:

MH> I'm hitting another corner case in some other "weird" history, where
MH> I have 500k commits all with the same date.

[2]: https://public-inbox.org/git/20190518005412.n45pj5p2rrtm2bfj@xxxxxxxxxxxx/t/#u

--
Jakub Narębski