On Wed, Mar 27, 2019 at 09:06:20PM -0400, Farhan Khan wrote: > I am trying to figure out how to calculate the SHA value of a pack file when you > run `git index-pack file.pack`. I am close, but having a bit of trouble at the > end. Here's my understanding so far. It's all but the last 20 bytes. You should be able to reproduce it with: # the computed sha1 size=$(stat --format=%s $pack) head -c $((size-20)) $pack | sha1sum # the sha1 stored in the file, which should match tail -c 20 $pack | xxd > Git buffers data to be processed and when its exhausted, updates the SHA > checksum with the previously read data. This is from builtin/index-pack.c, > specifically fill() which calls flush() to update the SHA value. My question is, > how does git determine how many bytes at a time to process? > > The size of the buffer is the file-scope variable input_len. This size seems to > be 4096 several times until the very end where it reduces to less-than 4096 > (obviously this depends on the pack file, but in my case its 1074 bytes). > Ordinarily I would think its a result of the read() call not receiving the full > 4096 bytes, but there still are left over bytes in the file but my manual > verification shows there are still remaining bytes in the file which are not run > through the SHA checksum. On the fill() side, we may over-read bytes into our buffer. But it's on the use() side that we actually decide bytes have been used. Note that it increments input_offset, and then flush() only hashes bytes up to that offset. So index-pack is not just blindly hashing N-20 bytes. It's actually parsing the packfile as it goes, and putting any data it has parsed correctly into the hash. At the end, we _should_ be left with exactly 20 bytes, and they should match exactly the hash we've computed up to that point. And in --verify mode (and maybe even other modes) it should be confirming that. > How does git calculate a pack file's SHA verification? How does it know what > size (number of bytes) to read when running flush() to update the buffer? > (typically 4096). How does it know when in the file to stop updating the SHA1 > value? The key observation is that flush() isn't actually reading into the buffer. It's throwing away bytes that have already been marked as used by use() and shifting the rest to the front of the buffer. And then fill() is free to read more data into the rest of it. -Peff