Linus Torvalds <torvalds@xxxxxxxx> writes: > On Mon, 3 Apr 2006, Linus Torvalds wrote: >> >> That said, I think git _does_ have problems with large pack-files. We have >> some 32-bit issues etc > > I should clarify that. git _itself_ shouldn't have any 32-bit issues, but > the packfile data structure does. The index has 32-bit offsets into > individual pack-files. > > That's not hugely fundamental,... Linus _does_ understand what he means, but let me clarify and outline a possible future direction. * pack-*.pack file has the following format: - The header appears at the beginning and consists of the following: 4-byte signature 4-byte version number (network byte order) 4-byte number of objects contained in the pack (network byte order) Observation: we cannot have more than 4G versions ;-) and more than 4G objects in a pack. - The header is followed by number of object entries, each of which looks like this: (undeltified representation) n-byte type and length (4-bit type, (n-1)*7+4-bit length) compressed data (deltified representation) n-byte type and length (4-bit type, (n-1)*7+4-bit length) 20-byte base object name compressed delta data Observation: length of each object is encoded in a variable length format and is not constrained to 32-bit or anything. - The trailer records 20-byte SHA1 checksum of all of the above. * pack-*.idx file has the following format: - The header consists of 256 4-byte network byte order integers. N-th entry of this table records the number of objects in the corresponding pack, the first byte of whose object name are smaller than N. Observation: we would need to extend this to an array of 8-byte integers to go beyond 4G objects per pack, but it is not strictly necessary. - The header is followed by sorted 28-byte entries, one entry per object in the pack. Each entry is: 4-byte network byte order integer, recording where the object is stored in the packfile as the offset from the beginning. 20-byte object name. Observation: we would definitely need to extend this to 8-byte integer plus 20-byte object name to handle a packfile that is larger than 4GB. - The file is concluded with a trailer: A copy of the 20-byte SHA1 checksum at the end of corresponding packfile. 20-byte SHA1-checksum of all of the above. This is not fundamental, in that pack idx file is something we can regenerate from a packfile. The push/fetch transfer over git native protocols does not even transfer pack idx file; instead, the recipient uses git-index-pack to generate pack idx. git-index-pack would need to be updated to update the necessary fields to 8-byte integers, without breaking existing packfiles. The code to read idx file currently has a sanity check logic to make sure that the size of the idx file is consistent with 24-byte entries (the last entry in the header matches the number of objects recorded in the pack). So we could reliably tell between the current 24-byte version and 28-byte "beyond 4GB" version, and support both formats at the same time. Even after we start supporting the 28-byte "beyond 4GB" format, we can and we should continue writing the current 24-byte version of pack idx file when the packfile offset can be expressed with 32-bit. Having said that, I have to warn that this is not for weak of heart. The necessary changes would be somewhat involved. ---------------------------------------------------------------- Pack idx file idx +--------------------------------+ | fanout[0] = 2 |-. +--------------------------------+ | | fanout[1] | | +--------------------------------+ | | fanout[2] | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | fanout[255] | | +--------------------------------+ | main | offset | | index | object name 00XXXXXXXXXXXXXXXX | | table +--------------------------------+ | | offset | | | object name 00XXXXXXXXXXXXXXXX | | +--------------------------------+ | .-| offset |<+ | | object name 01XXXXXXXXXXXXXXXX | | +--------------------------------+ | | offset | | | object name 01XXXXXXXXXXXXXXXX | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | offset | | | object name FFXXXXXXXXXXXXXXXX | | +--------------------------------+ trailer | | packfile checksum | | +--------------------------------+ | | idxfile checksum | | +--------------------------------+ .-------. | Pack file entry: <+ packed object header: 1-byte type (bit 4-6) size0 (bit 0-3) end-of-length (bit 7) n-byte sizeN (as long as MSB is set, each 7-bit) size0..sizeN form 4+7+7+..+7 bit integer, size0 is the most significant part. packed object data: If it is not DELTA, then deflated bytes (the size above is the size before compression). If it is DELTA, then 20-byte base object name SHA1 (the size above is the size of the delta data that follows). delta data, deflated. - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html