On 7 Sep 2006 09:34:56 -0400, linux@xxxxxxxxxxx <linux@xxxxxxxxxxx> wrote:
>> An alternative would be to create a small "placeholder" object that >> just gives an ID, then refer to it by offset. >> >> That would avoid the need for an id/offset bit with every offset, >> and possibly save more space if the same object was referenced >> multiple times. >> >> And it just seems simpler. > There are 2 million objects in the Mozilla pack. This table would take: > 2M * (20b (sha) + 10b(object index/overhead) = 60MB > This 60MB is pretty much incompressible and increases download time. > > Much better if storage of the sha1s can be totally eliminated and > replaced by something smaller. Alternatively this map could be > stripped for transmission and rebuilt locally. Um, I think I wasn't clear. Objects in a "thin" pack (for network updating of a different pack) that are referred to but not included would have stand-ins containing just the object ID. Objects that *are* present would simply be present and referred to by offset as usual.
Yes I missed the thin pack context. Has anyone tried building a thin pack? Having thin packs would address major concerns with the download time of the initial Mozilla checkout. Right now you have to download 450MB of data that 99% of the people are never going to look at. Thin packs would also set an 'archival' bit as discussed earlier. The 'archival' bit tells the local tools to leave these packs alone. You don't want to accidentally trigger a 450MB down just because you did a git-repack. Of course git would need to developer error messages indicating that you asked for something in the archive that isn't present. I'm not sure I would mix these external references in with other objects. Instead I would make a pair of packs, one only contains external reference stubs and is tiny. The second is the full version of the same data. That makes getting the old data easy, it just replaces the stub pack. The stub pack doesn't need to contain stubs for all the objects, only the ones referenced upstream. I haven't tried doing this, does the current git code support having multiple pack files covering difference pieces of the project? Can we have an archive pack and a local current pack that get merged together?
Imagine you have a "thin" pack containing a delta to an object that the recipient has, so isn't in the pack. The delta has to specify the base object somehow. If the base object is in the pack, you can specify it by offset. If it's not, you can either: - Generalize the base object pointer to allow an object ID option, or - Provide a pointer to a magic kind of "external reference" pointer object. I was proposing the latter. For regular packs, such objects wouldn't even be present, because all base objects are in the pack itself. And, of course, you'd only create such objects if you needed to, if there was at least one pointer to them. Compared to putting the object ID directly in the pointer, it has Cost: An extra offset pointer and object header. Extra time follwoing the indirection resolving the pointer. Benefit: Non-indirect object pointers are a bit smaller. The code is simpler. Second and later references to the same external object are another offset, not another 20 bytes.
-- Jon Smirl jonsmirl@xxxxxxxxx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html