Re: Stat cache in .git/index hinders syncing of repositories

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Mon, 20 Jan 2020 13:01:54 +0100 (CET)

Hi Christoph,

On Sat, 18 Jan 2020, Christoph Groth wrote:

> brian m. carlson wrote:
> > On 2020-01-18 at 19:06:21, Christoph Groth wrote:
> > > But if the above is not feasible for some reason, would it be
> > > possible to provide a switch for disabling stat caching
> > > optimization?
> >
> > Git is going to perform really terribly on repositories of any size if
> > you disable stat caching, so we're not very likely to implement such
> > a feature.  Even if we did implement it, you probably wouldn't want to
> > use it.
>
> OK, I see.  But please consider (one day) to split up the index file to
> separate the local stat cache from the globally valid data.

I am sure that this has been considered even before Git was publicly
announced, and I would wager a guess that it was determined that it would
be better to keep all of Git's private data in one place.

Now, you are totally free to disagree, and even to work on a patch series
to separate the stat cache and offer a compelling argument why this change
should be made. If I were you, I would not expect any other person to be
interested in working on this.

> (By the way, even after 12 years of using Git intensely I am confused
> about what actually is the index.  I believed that it is the "staging
> area", like in "git-add - Add file contents to the index".  But then the
> .git/index file reflects all the tracked files, and not just staged
> ones.  This usage is also reflected by the command "git update-index".)

The concept of the Git index is slightly different from what is actually
stored inside `.git/index`. You should consider the latter to be an
implementation detail that is of concern only if you want to work on
internals. Otherwise the description of the index as a staging area is a
pretty good image.

The staging area contains of course more than just the stages you changed.
It contains the entire tree that is staged in order to become the next
commit.

If you asked a worker at a theater to make a minor change to the stage,
you would not expect the staging area to be empty, either.

> > However, there are the core.checkStat and core.trustctime options
> > which can control which information is used in the stat caching.  You
> > can restrict it to the whole second part of mtime and the file size if
> > you want.  See git-config(1) for more details.
>
> Thanks a lot, that did the trick!  I’ve been already syncing mtimes.
> Setting both core.checkStat and core.trustctime to the "weak" values
> made the spurious modifications go away.

And of course now you have a less performant setup because files have a
much better chance of being "racily clean", i.e. their mtime could be
identical to the `.git/index` file, in which case Git has to assume that
the file might have changed, and the index has to be refreshed.

Just saying that what you think of as a silver bullet comes at a price.

> Still, this is a workaround, and the price is reduced robustness of file
> modification detection.

You misunderstand how Git detects whether a file is modified or not.

A file is re-hashed if its mtime is newer than, _or equal to_, the mtime
of `.git/index`.

So no, it is not the robustness that is the problem. It is no less robust.
The problem is that you force re-hashing where it would not be necessary
otherwise.

In general, I am not sure that you are using the right tool for
synchronizing. If you cannot guarantee that a snapshot of the directory is
copied, you will always run the risk of inconsistent data, which is worse
than not having a backup at all: at least without a backup you do not have
a false sense of security.

Ciao,
Johannes