On Thu, Mar 2, 2017 at 10:37 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote: >> >> Now, if your _file_ index is 300-400MB (and I do think we check the >> SHA fingerprint on that even on just reading it - verify_hdr() in >> do_read_index()), then that's going to be a somewhat noticeable hit on >> every normal "git diff" etc. > > Yes, the .git/index is 450MB with ~3.1M entries. verify_hdr() is called > each time we read it into memory. Ok. So that's really just a purely historical artifact. The file index is actually the first part of git to have ever been written. You can't even see it in the history, because the initial revision from Apr 7, 2005, obviously depended on the actual object hashing. But the file index actually came first. You can _kind_ of see that in the layout of the original git tree, and how the main header file is still called "cache.h", and how the original ".git" directory was actually called ".dircache". And the two biggest files (by a fairly big margin) are "read-cache.c" and "update-cache.c". So that file index cache was in many ways _the_ central part of the original git model. The sha1 file indexing and object database was just the backing store for the file index. But part of that history is then how much I worried about corruption of that index (and, let's face it, general corruption resistance _was_ one of the primary design goals - performance was high up there too, but safety in the face of filesystem corruption was and is a primary issue). But realistically, I don't think we've *ever* hit anything serious on the index file, and it's obviously not a security issue. It also isn't even a compatibility issue, so it would be trivial to just bump the version header and saying that the signature changes the meaning of the checksum. That said: > We have been testing a patch in GfW to run the verification in a separate thread > while the main thread parses (and mallocs) the cache_entries. This does help > offset the time. Yeah, that seems an even better solution, honestly. The patch would be cleaner without the NO_PTHREADS things. I wonder how meaningful that thing even is today. Looking at what seems to select NO_PTHREADS, I suspect that's all entirely historical. For example, you'll see it for QNX etc, which seems wrong - QNX definitely has pthreads according to their docs, for example. Linus