Re: [PATCH 02/30] read-cache: add index.computeHash config option

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Mon, 14 Nov 2022 11:30:41 -0500

On 11/11/2022 6:31 PM, Elijah Newren wrote:
> On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@xxxxxxxxx> wrote:
>>
>> From: Derrick Stolee <derrickstolee@xxxxxxxxxx>
>>
>> The previous change allowed skipping the hashing portion of the
>> hashwrite API, using it instead as a buffered write API. Disabling the
>> hashwrite can be particularly helpful when the write operation is in a
>> critical path.
>>
>> One such critical path is the writing of the index. This operation is so
>> critical that the sparse index was created specifically to reduce the
>> size of the index to make these writes (and reads) faster.
>>
>> Following a similar approach to one used in the microsoft/git fork [1],
>> add a new config option that allows disabling this hashing during the
>> index write. The cost is that we can no longer validate the contents for
>> corruption-at-rest using the trailing hash.
>>
>> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
>>
>> While older Git versions will not recognize the null hash as a special
>> case, the file format itself is still being met in terms of its
>> structure. Using this null hash will still allow Git operations to
>> function across older versions.
>>
>> The one exception is 'git fsck' which checks the hash of the index file.
>> Here, we disable this check if the trailing hash is all zeroes. We add a
>> warning to the config option that this may cause undesirable behavior
>> with older Git versions.
>>
>> As a quick comparison, I tested 'git update-index --force-write' with
>> and without index.computHash=false on a copy of the Linux kernel
>> repository.
>>
>> Benchmark 1: with hash
>>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>>
>> Benchmark 2: without hash
>>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>>
>> Summary
>>   'without hash' ran
>>     1.78 ± 0.76 times faster than 'with hash'
>>
>> These performance benefits are substantial enough to allow users the
>> ability to opt-in to this feature, even with the potential confusion
>> with older 'git fsck' versions.
> 
> This is impressive and interesting...but an improvement unrelated to
> this series other than the fact that it builds on some of it.  Perhaps
> pull this patch out?

While patch 1 is required for the packed-refs work, this one is an easy
way to take advantage of it. I'll submit these two patches soon on their
own as the rest of the RFC is discussed.

> Also, would it make sense to integrate index.computeHash with feature.manyFiles?

It would make sense to include in feature.manyFiles and Scalar's recommended
config. I expect that it would be good to have the config available in a Git
release before updating those configs to include it. Perhaps that is too
conservative, though.

Thanks,
-Stolee