Re: Git should preserve modification times at least on request

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 21 Feb 2018 23:44:13 +0100

On Wed, Feb 21 2018, Peter Backes jotted:

> On Wed, Feb 21, 2018 at 10:33:05PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> This sounds like a sensible job for a git import tool, i.e. import a
>> target directory into git, and instead of 'git add'-ing the whole thing
>> it would look at the mtimes, sort files by mtime, then add them in order
>> and only commit those files that had the same mtime in the same commit
>> (or within some boundary).
>
> I think that this would be The Wrong Thing to do.

I'm merely pointing out that if you have the use-case Derek Fawcus
describes you can get per-file mtimes via something similar to the the
hook method Theodore Ts'o described today with a simple import tool with
no changes to git or its object format required.

To the extent that there's a convention for this in git that's the
convention, e.g. if you use github or gitlab they'll render the
modification time of a file in the tree view, and that time is the time
of the commit that last touched it: https://github.com/git/git

> The commit time is just that: The time the commit was done. The commit
> is an atomic group of changes to a number of files that hopefully bring
> the tree from one usable state into the next.
>
> The mtime, in contrast, tells us when a file was most recently modified.
>
> It may well be that main.c was most recently modified yesterday, and
> feature.c was modified this morning, and that only both changes taken
> together make sense as a commit, despite the long time in between.
>
> Even worse, it may be that feature A took a long time to implement, so
> we have huge gaps in between the mtimes, but feature B was quickly done
> after A was finished.

...

> Such an algorithm would probably split feature A
> incorrectly into several commits, and group the more recently changed
> files of feature A with those of feature B.

Right, but that's a trade-off you can pick at import time in this
hypothetical tar-to-commits tool, you could decide to do no merging and
suffer to signal loss.

> And if Feature A and Feature B were developed in parallel, things get
> completely messy.
>
>> The advantage of doing this via such a tool is that you could tweak it
>> to commit by any criteria you wanted, e.g. not mtime but ctime or even
>> atime.
>
> Maybe, but it would be rather useless to commit by ctime or atime. You
> do one grep -r and the atime is different. You do one chmod or chown
> and the ctime is different. Those timestamps are really only useful for
> very limited purposes.
>
> That ctime exists seems reasonable, since it's only ever updated when
> the inode is written anyway.
>
> atime, in contrast, was clearly one of the rather nonsensical
> innovations of UNIX: Do one write to the disk for each read from the
> disk. C'mon, really? It would have been a lot more reasonable to simply
> provide a generic way for tracing read() system calls instead; then
> userspace could decide what to do with that information and which of it
> is useful and should be kept and perhaps stored on disk. Now we have
> this ugly hack called relatime to deal with the problem.

Yes, that [ac]time example was a stretch. A better example would be
committing the file mode, or extended attributes, or "this is on a
different FS", or whatever other per-file/dir attribute we're not
currently capturing.

>> You'd get the same thing as you'd get if git's tree format would change
>> to include mtimes (which isn't going to happen), but with a lot more
>> flexibility.
>
> Well, from basic logic, I don't see how a decision not to implement a
> feature could possibly increase flexility. The opposite seems to be the
> case.

I'm not trying to argue the usefulness of this mtime-per-file thing in
theory, just providing Derek Fawcus with a suggestion for a viable
workaround.

What I meant by this offhand comment, and which you may or may not know
(and I see no references to it from skimming the thread) is that there's
simply no space in the tree objects to add *anything* without breaking
the object format and requiring a major upgrade, although the plan to
switch to a new hash function is relevant to this.

Even if we suppose that git was being implemented today I don't think
this would make any sense as a first-level feature.

Empirical evidence suggests that people use git on a massive scale
largely without caring about this, and the users who do have a
workaround.

If it were added as a first-level feature to git it would present a lot
of UX confusion. E.g. you run "git add" and it'll be showing the mtime
somehow, or you get a formatted patch over E-Mail and it doesn't only
include the commit time but also times for individual files.

The VC systems that had this feature in the past were centralized, so
they could (in theory anyway) ensure that timestamps were monotonically
increasing. This won't be the case with git, we have plenty of timestamp
drift in e.g. linux.git and other git repos.

So if these mtimes were used by default they'd interact badly with stuff
like "make" in those cases, because you might check out a modified
version with a timestamp in the past.

Or maybe I've misunderstood how that worked in CVS/SVN/Bitkeeper, but in
any case, I just wanted to point out a workaround (but then digressed
into critiquing the idea above...).