Re: Fwd: Git and Large Binaries: A Proposed Solution

Alexander Miseler <alexander@xxxxxxxxxx> · Sun, 13 Mar 2011 20:33:18 +0100

My thoughts on big file storage:

We want to store them as flat as possible. Ideally if we have a temp file with the content (e.g. the output of some filter) it should be possible to store it by simply doing a move/rename and updating some meta data external to the actual file.

Options:

1.) The loose file format is inherently unsuited for this. It has a header before the actual content and the whole file (header + content) is always compressed. Even if one changes this to compressing/decompressing header and content independently it is still unsuited by a) having the header within the same file and b) because the header has no flags or other means to indicate a different behavior (e.g. no compression) for the content. We could extend the header format or introduce a new object type (e.g. flatblob) but both would probably cause more trouble than other solutions. Another idea would be to keep the metadata in an external file (e.g. 84d7.header for the object 84d7). This would probably have a bad performance though since every object lookup would first need to check for the existence of a header file. A smarter variant would be to optionally keep the meta data directly in the filename (e.g. saving the object as 84d7.object_type.size.flag instead of just 84d7). 
This would only require special handling for cases where the normal lookup for 84d7 fails.

2.) The pack format fares a lot better. Content and meta data are already separated with the meta data describing how the content is stored. We would need a flag to mark the content as flat and that would pretty much be it. We would still need to include a virtual header when calculating the sha1 so it is guaranteed that the same content has always the same id.
Thus i think we should simply forgo the loose object phase when storing big files and simply drop each big file flat as a individual pack file, with the idx file describing it as a pack file with one entry which is stored flat.

3.) Do some completely different handling for big files, as suggested by Eric:
>>   1.1 Perhaps a "binaries" directory, or structure of directories, within .git
> 
> I'd rather not do something so drastic.
My main issue with this approach (apart from the 'drastic' ^_^) is that the definition of big file may change at any time by e.g. changing a config value like core.bigFileThreshold. What has been stored as big file may suddenly be considered a normal blob and vice versa. Thus any storage variant that isn't well integrated in the normal object storage will probably be troublesome.

> There may also be code-paths for binary files where
> we accidentally load them (I just fixed one last week where we
> unnecessarily loaded them in the diffstat code path). Somebody will need
> to do some experimenting to shake out those code paths.

This is my main focus for now. They are easy to detect when your memory is small enough :D

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html