Re: Cygwin can't handle huge packfiles?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Linus Torvalds <torvalds@xxxxxxxx> writes:

> On Mon, 3 Apr 2006, Linus Torvalds wrote:
>> 
>> That said, I think git _does_ have problems with large pack-files. We have 
>> some 32-bit issues etc
>
> I should clarify that. git _itself_ shouldn't have any 32-bit issues, but 
> the packfile data structure does. The index has 32-bit offsets into 
> individual pack-files. 
>
> That's not hugely fundamental,...

Linus _does_ understand what he means, but let me clarify and
outline a possible future direction.

 * pack-*.pack file has the following format:

   - The header appears at the beginning and consists of the following:

     4-byte signature
     4-byte version number (network byte order)
     4-byte number of objects contained in the pack (network byte order)

     Observation: we cannot have more than 4G versions ;-) and
     more than 4G objects in a pack.

   - The header is followed by number of object entries, each of
     which looks like this:

     (undeltified representation)
     n-byte type and length (4-bit type, (n-1)*7+4-bit length)
     compressed data

     (deltified representation)
     n-byte type and length (4-bit type, (n-1)*7+4-bit length)
     20-byte base object name
     compressed delta data

     Observation: length of each object is encoded in a variable
     length format and is not constrained to 32-bit or anything.

  - The trailer records 20-byte SHA1 checksum of all of the above.

 * pack-*.idx file has the following format:

  - The header consists of 256 4-byte network byte order
    integers.  N-th entry of this table records the number of
    objects in the corresponding pack, the first byte of whose
    object name are smaller than N.

    Observation: we would need to extend this to an array of
    8-byte integers to go beyond 4G objects per pack, but it is
    not strictly necessary.

  - The header is followed by sorted 28-byte entries, one entry
    per object in the pack.  Each entry is:

    4-byte network byte order integer, recording where the
    object is stored in the packfile as the offset from the
    beginning.

    20-byte object name.

    Observation: we would definitely need to extend this to
    8-byte integer plus 20-byte object name to handle a packfile
    that is larger than 4GB.

  - The file is concluded with a trailer:

    A copy of the 20-byte SHA1 checksum at the end of
    corresponding packfile.

    20-byte SHA1-checksum of all of the above.

This is not fundamental, in that pack idx file is something we
can regenerate from a packfile.  The push/fetch transfer over
git native protocols does not even transfer pack idx file;
instead, the recipient uses git-index-pack to generate pack idx.
git-index-pack would need to be updated to update the necessary
fields to 8-byte integers, without breaking existing packfiles.

The code to read idx file currently has a sanity check logic to
make sure that the size of the idx file is consistent with
24-byte entries (the last entry in the header matches the number
of objects recorded in the pack).  So we could reliably tell
between the current 24-byte version and 28-byte "beyond 4GB"
version, and support both formats at the same time.

Even after we start supporting the 28-byte "beyond 4GB" format,
we can and we should continue writing the current 24-byte
version of pack idx file when the packfile offset can be
expressed with 32-bit.

Having said that, I have to warn that this is not for weak of
heart.  The necessary changes would be somewhat involved.


----------------------------------------------------------------

Pack idx file

	idx
	    +--------------------------------+
	    | fanout[0] = 2                  |-.
	    +--------------------------------+ |
	    | fanout[1]                      | |
	    +--------------------------------+ |
	    | fanout[2]                      | |
	    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
	    | fanout[255]                    | |
	    +--------------------------------+ |
main	    | offset                         | |
index	    | object name 00XXXXXXXXXXXXXXXX | |
table	    +--------------------------------+ | 
	    | offset                         | |
	    | object name 00XXXXXXXXXXXXXXXX | |
	    +--------------------------------+ |
	  .-| offset                         |<+
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | +--------------------------------+
	  | | offset                         |
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	  | | offset                         |
	  | | object name FFXXXXXXXXXXXXXXXX |
	  | +--------------------------------+
trailer	  | | packfile checksum              |
	  | +--------------------------------+
	  | | idxfile checksum               |
	  | +--------------------------------+
          .-------.      
                  |
Pack file entry: <+

     packed object header:
	1-byte type (bit 4-6)
	       size0 (bit 0-3)
               end-of-length (bit 7)
        n-byte sizeN (as long as MSB is set, each 7-bit)
		size0..sizeN form 4+7+7+..+7 bit integer, size0
		is the most significant part.
     packed object data:
        If it is not DELTA, then deflated bytes (the size above
		is the size before compression).
	If it is DELTA, then
	  20-byte base object name SHA1 (the size above is the
	  	size of the delta data that follows).
          delta data, deflated.


-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]