Re: Confused over packfile and index design

Jeff King <peff@xxxxxxxx> · Fri, 8 Apr 2011 20:20:48 -0400

On Fri, Apr 08, 2011 at 07:58:41PM -0400, Steven E. Harris wrote:

> ,----
> | Importantly, packfile indexes are /not/ neccesary to extract objects
> | from a packfile, they are simply used to quickly retrieve individual
> | objects from a pack. The packfile format is used in upload-pack and
> | receieve-pack programs (push and fetch protocols) to transfer objects
> | and there is no index used then - it can be built after the fact by
> | scanning the packfile.
> `----
> 
> That suggests that it's possible to read the packfile linearly and
> deduce where the various objects start and end, without the index
> available.

Yes. For example, when we do a "git fetch", we get _just_ the packfile
and create our own local index.

> Later, in the section on the packfile format, we find this:
> 
> ,----
> | It is important to note that the size specified in the header data is
> | not the size of the data that actually follows, but the size of that
> | data /when expanded/. This is why the offsets in the packfile index are
> | so useful, otherwise you have to expand every object just to tell when
> | the next header starts.
> `----
> 
> Now that makes it sound like without the index, even if one knows where
> a packed object starts, reading its header tells its /inflated/ size,
> /not/ the number of remaining payload bytes representing the object. If
> that's true, then how does one figure out where one object ends and the
> next one begins /without the index/?

The actual object data (whether it is the object itself or a delta) is
all zlib-encoded, so it has its own size header and checksum there, I
believe. The pack-format documentation is a bit vague, but a quick read
of unpack_raw_entry and unpack_entry_data in builtin/index-pack.c seems
to confirm that this is how it works.

Take that response with a grain of salt, though. That is just from my
quick read of the code, so I could be wrong.

> Recall that the first paragraph quoted above says that the index can be
> built from the packfile, as opposed to it being essential to reading the
> packfile. Is one of these paragraphs incorrect?

No, if I'm correct, it is just that there is an extra header that
neither mentions. :)

> The Git documentation on the pack formatÂ mentions that the packed
> object headers represent the lengths as variable-sized integers
> 
> ,----
> | n-byte type and length (3-bit type, (n-1)*7+4-bit length)
> `----
> 
> but it doesn't say whether that's the number of (deflated) payload bytes
> or the inflated object size, as the Git Book asserts.

That should be the inflated object size.

> I imagine that if the format is meant to record the size of the deflated
> payload, then it would be challenging to compress the data straight into
> the packfile, because one wouldn't know the final size until it was
> written, which means that one wouldn't know how many bytes will be
> necessary to write its length in the header, which means one wouldn't
> know where to start writing the deflated payload.

I believe zlib handles streaming it out for us. I'm not too familiar
with zlib's format, but I assume it outputs in chunks with occasional
headers. So finding the end of stream means while reading through the
whole stream and skipping past each chunk.

> Are there any other clarifying documents you can recommend to understand
> the design?

Not that I know of; what's in docs/technical is generally authoritative,
except for reading the code.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html