Re: [PATCH 0/4] Optionally skip hashing index on write

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Thu, 8 Dec 2022 11:38:39 -0500

On 12/7/2022 6:27 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> Writing the index is a critical action that takes place in multiple Git
>> commands. The recent performance improvements available with the sparse
>> index show how often the I/O costs around the index can affect different Git
>> commands, although reading the index takes place more often than a write.
> 
> The sparse-index work is great in that it offers correctness while
> taking advantage of the knowledge of which part of the tree is
> quiescent and unused to boost performance.  I am not sure a change
> to reduce file safety can be compared with it, in that one is pure
> improvement, while the other is trade-off.

I agree that this is a trade-off, and we should both be careful about
whether or not we even make this a possibility for certain file
formats. The index is an interesting case for a couple reasons:

1. Writes block users. Writing the index takes place in many user-
   blocking foreground operations. The speed improvement directly
   impacts their use. Other file formats are typically written in
   the background (commit-graph, multi-pack-index) or are super-
   critical to correctness (pack-files).

2. Index files are short lived. It is rare that a user leaves an
   index for a long time with many staged changes. That's the condition
   that's required for losing an index file to cause a loss of work
   (or maybe I'm missing something). Outside of staged changes, the
   index can be completely destroyed and rewritten with minimal impact
   to the user.

> As long as we will keep the "create into a new file, write it fully
> and fsync + rename to the final" pattern, we do not need the trailing
> checksum to protect us from a truncated output due to index-writing
> process dying in the middle, so I do not mind that trade-off, though.
> 
> Protecting files from bit flipping filesystem corruption is a
> different matter.  Folks at hosting sites like GitHub would know how
> often they detect object corruption (I presume they do not have to
> deal with the index file on the server end that often, but loose and
> pack object files have the trailing checksums the same way) thanks
> to the trailing checksum, and what the consequences are if we lost
> that safety (I am guessing it would be minimum, though).

I agree that we need to be careful about which files get this
treatement.

But I also want to point out that I'm not using hosting servers as
evidence that this has worked in practice, but instead many developer
machines in large monorepos who have had this enabled (via the
microsoft/git fork) for years. We've not come across an instance where
this loss of a trailing hash has been an issue.

Thanks,
-Stolee