On Fri, Apr 08, 2011 at 07:58:41PM -0400, Steven E. Harris wrote: > ,---- > | Importantly, packfile indexes are /not/ neccesary to extract objects > | from a packfile, they are simply used to quickly retrieve individual > | objects from a pack. The packfile format is used in upload-pack and > | receieve-pack programs (push and fetch protocols) to transfer objects > | and there is no index used then - it can be built after the fact by > | scanning the packfile. > `---- > > That suggests that it's possible to read the packfile linearly and > deduce where the various objects start and end, without the index > available. Yes. For example, when we do a "git fetch", we get _just_ the packfile and create our own local index. > Later, in the section on the packfile format, we find this: > > ,---- > | It is important to note that the size specified in the header data is > | not the size of the data that actually follows, but the size of that > | data /when expanded/. This is why the offsets in the packfile index are > | so useful, otherwise you have to expand every object just to tell when > | the next header starts. > `---- > > Now that makes it sound like without the index, even if one knows where > a packed object starts, reading its header tells its /inflated/ size, > /not/ the number of remaining payload bytes representing the object. If > that's true, then how does one figure out where one object ends and the > next one begins /without the index/? The actual object data (whether it is the object itself or a delta) is all zlib-encoded, so it has its own size header and checksum there, I believe. The pack-format documentation is a bit vague, but a quick read of unpack_raw_entry and unpack_entry_data in builtin/index-pack.c seems to confirm that this is how it works. Take that response with a grain of salt, though. That is just from my quick read of the code, so I could be wrong. > Recall that the first paragraph quoted above says that the index can be > built from the packfile, as opposed to it being essential to reading the > packfile. Is one of these paragraphs incorrect? No, if I'm correct, it is just that there is an extra header that neither mentions. :) > The Git documentation on the pack format mentions that the packed > object headers represent the lengths as variable-sized integers > > ,---- > | n-byte type and length (3-bit type, (n-1)*7+4-bit length) > `---- > > but it doesn't say whether that's the number of (deflated) payload bytes > or the inflated object size, as the Git Book asserts. That should be the inflated object size. > I imagine that if the format is meant to record the size of the deflated > payload, then it would be challenging to compress the data straight into > the packfile, because one wouldn't know the final size until it was > written, which means that one wouldn't know how many bytes will be > necessary to write its length in the header, which means one wouldn't > know where to start writing the deflated payload. I believe zlib handles streaming it out for us. I'm not too familiar with zlib's format, but I assume it outputs in chunks with occasional headers. So finding the end of stream means while reading through the whole stream and skipping past each chunk. > Are there any other clarifying documents you can recommend to understand > the design? Not that I know of; what's in docs/technical is generally authoritative, except for reading the code. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html