Re: Stat cache in .git/index hinders syncing of repositories

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Tue, 21 Jan 2020 02:53:11 +0000

On 2020-01-20 at 23:53:22, Christoph Groth wrote:
> Johannes Schindelin wrote:
> >
> > On Sat, 18 Jan 2020, Christoph Groth wrote:
> >
> > > OK, I see.  But please consider (one day) to split up the index file
> > > to separate the local stat cache from the globally valid data.
> >
> > I am sure that this has been considered even before Git was publicly
> > announced,
> 
> I would be very interested to hear the rationale for keeping the
> information about what is staged and the stat cache together in the same
> file.  I, or someone else, might actually work on a patch one day, but
> before starting, it would be good to understand the reasoning behind the
> current design.
> 
> > and I would wager a guess that it was determined that it would be
> > better to keep all of Git's private data in one place.
> 
> My point is that it’s not just private data: When I excluded .git/index
> from synchronization, staging files for a commit was no longer
> synchronized.

To try to answer this question, Git stores all of its state about the
working tree in the index.  Bare repositories don't typically have an
index because they don't have a working tree.  Whether that state is
staged contents or stat information, all of it is in one file.

Storing all of this data in one file means that only one file need be
mapped into memory and rewritten.  Git writes to the index by atomically
creating a lock file along side of it and writing the new contents into
it, and then doing an atomic replace.  This approach wouldn't be
possible with multiple files, and any update to it wouldn't be atomic.

There is support for a split index mode which means that the main index
need not be rewritten as often, which is helpful when making small
updates to large trees, where the cost of rewriting the index is
significant.  I don't know how locking is handled there[0], but I assume
that it is, because the people who implemented and reviewed it are
capable and thoughtful.

However, having said that, nobody has provided a compelling case for
using multiple files for storing different types of working tree state.
The existing options are available for cases like yours and others', and
they work.  Since there are clear benefits to the current model,
including simplicity and robustness, and few downsides, nobody has
decided to change it.

I should add that even if, for some reason, we did add support for
splitting this data out, I'm not sure if we'd support syncing only part
of the repository state and blowing away other state.  We don't really
support that now (other than through tools like fetch and clone) and I
don't think we'd want to encourage that behavior in the future.

[0] And I have not had the interest to look at this present moment.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
Attachment:
signature.asc

Description: PGP signature