On 3/2/2017 11:35 AM, Linus Torvalds wrote:
On Thu, Mar 2, 2017 at 6:45 AM, Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
It would probably make sense to switch the index integrity check away from
SHA-1 because we really only care about detecting bit flips there, and we
have no need for the computational overhead of using a full-blown
cryptographic hash for that purpose.
Which index do you actually see as being a problem, btw? The main file
index (.git/index) or the pack-file indexes?
We definitely don't need the checking version of sha1 for either of
those, but as Jeff already did the math, at least the pack-file index
is almost negligible, because the pack-file operations that update it
end up doing SHA1 over the objects - and the object SHA1 calculations
are much bigger.
And I don't think we even check the pack-file index hashes except on fsck.
Now, if your _file_ index is 300-400MB (and I do think we check the
SHA fingerprint on that even on just reading it - verify_hdr() in
do_read_index()), then that's going to be a somewhat noticeable hit on
every normal "git diff" etc.
Yes, the .git/index is 450MB with ~3.1M entries. verify_hdr() is called
each time
we read it into memory.
We have been testing a patch in GfW to run the verification in a
separate thread
while the main thread parses (and mallocs) the cache_entries. This does
help
offset the time.
https://github.com/git-for-windows/git/pull/978/files
But I'd have expected the stat() calls of all the files listed by that
index to be the _much_ bigger problem in that case. Or do you just
turn those off with assume-unchanged?
Yeah, those stat calls are threaded when preloading, but even so..
Yes, the stat() calls are more significant percentage of the time (and
having
core.fscache and core.preloadindex help that greatly), but the total
time for a command
is just that -- the total -- so using the philosophy of "every little
bit helps", the faster
routines help us here.
Anyway, the file index SHA1 checking could probably just be disabled
entirely (with a config flag). It's a corruption check that simply
isn't that important. So if that's your main SHA1 issue, that would be
easy to fix.
Yes, in the GVFS effort, we disabled the verification with a config
setting and haven't
had any incidents.
Everything else - like pack-file generation etc for a big clone() may
end up using a ton of SHA1 too, but the SHA1 costs all scale with the
other costs that drown them out (ie zlib, network, etc).
I'd love to see a profile if you have one.
Linus