Re: Calculating pack file SHA value

Jeff King <peff@xxxxxxxx> · Wed, 27 Mar 2019 22:02:27 -0400

On Wed, Mar 27, 2019 at 09:06:20PM -0400, Farhan Khan wrote:

> I am trying to figure out how to calculate the SHA value of a pack file when you
> run `git index-pack file.pack`. I am close, but having a bit of trouble at the
> end. Here's my understanding so far.

It's all but the last 20 bytes. You should be able to reproduce it with:

  # the computed sha1
  size=$(stat --format=%s $pack)
  head -c $((size-20)) $pack | sha1sum

  # the sha1 stored in the file, which should match
  tail -c 20 $pack | xxd

> Git buffers data to be processed and when its exhausted, updates the SHA
> checksum with the previously read data. This is from builtin/index-pack.c,
> specifically fill() which calls flush() to update the SHA value. My question is,
> how does git determine how many bytes at a time to process?
> 
> The size of the buffer is the file-scope variable input_len. This size seems to
> be 4096 several times until the very end where it reduces to less-than 4096
> (obviously this depends on the pack file, but in my case its 1074 bytes).
> Ordinarily I would think its a result of the read() call not receiving the full
> 4096 bytes, but there still are left over bytes in the file but my manual
> verification shows there are still remaining bytes in the file which are not run
> through the SHA checksum.

On the fill() side, we may over-read bytes into our buffer. But it's on
the use() side that we actually decide bytes have been used. Note that
it increments input_offset, and then flush() only hashes bytes up to
that offset.

So index-pack is not just blindly hashing N-20 bytes. It's actually
parsing the packfile as it goes, and putting any data it has parsed
correctly into the hash. At the end, we _should_ be left with exactly 20
bytes, and they should match exactly the hash we've computed up to that
point. And in --verify mode (and maybe even other modes) it should be
confirming that.

> How does git calculate a pack file's SHA verification? How does it know what
> size (number of bytes) to read when running flush() to update the buffer?
> (typically 4096). How does it know when in the file to stop updating the SHA1
> value?

The key observation is that flush() isn't actually reading into the
buffer. It's throwing away bytes that have already been marked as used
by use() and shifting the rest to the front of the buffer. And then
fill() is free to read more data into the rest of it.

-Peff