> Yes I missed the thin pack context. Has anyone tried building a thin > pack? Having thin packs would address major concerns with the download > time of the initial Mozilla checkout. Right now you have to download > 450MB of data that 99% of the people are never going to look at. Unfortunately, no, it wouldn't. There's a second misunderstanding here: thin packs are used by the git on-the-wire protocol to send a delta relative to an object *that you know the recipient already has*. This can greatly speed up incremental updates. It does NOT have any effect on the initial checkout, because you know the recipient has nothing, so you have to send everything. What you seem to be talking about is what people call a "shallow clone". Unfortunately, making a shallow clone is difficult for reasons that are deeply embedded in git's incremental update protocol. When you fetch from a remote repository, it needs to figure out what you need and what you already have. The fundamental assumption which the protocol uses to send this efficiently is that if you have a commit object, you have all its ancestors, and their trees and blobs. So you can say "I have 60d4684068ff1eec78f55b5888d0bd2d4cca1520" (which is the commit behind tags/v2.6.18-rc5 from the Linux kernel) and the server you're pulling from knows a lot about what you have. The first hard part about a shallow repository is describing *efficiently* to git-upload-pack what you don't have. The second is figuring out what you should get. Suppose that you clone from a repository that has a linear chain of development, and you cut off everything before version X. Simple enough so far. Then you try to pull from someone whose work branched off at version X-1. Do you want to fetch everything back to the common ancestor? If you want to cut off their branch, where do we cut it off? A simple timestamp limit? What if you get a bogus timestamp on a commit? This has been considered by git developers quite carefully and is NOT an easy problem. A solution is welcomed, but if you come up with an easy one, please spend a few minutes considering the possibility that the reason it's so easy is that your solution is flawed. There are also problems giving sensible error messages if you try to refer to a missing object and making your clone deeper if desired, but those are secondary to the big problems. > I'm not sure I would mix these external references in with other > objects. Instead I would make a pair of packs, one only contains > external reference stubs and is tiny. The second is the full version > of the same data. That makes getting the old data easy, it just > replaces the stub pack. The stub pack doesn't need to contain stubs > for all the objects, only the ones referenced upstream. Huh?? This is flatly impossible, and completely defeats the purpose. This is one of those "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" questions. I think there is a *serious* miscommunication here. I have a delta-compressed object. It is stored in the pack as a series of changes to some base object. To be of any use, it needs to identify that base object. If the base object identification takes the form of an intra-pack offset (the new proposal) as opposed to a global 20-byte object ID, then the base object needs to be stored *intra-pack*. Within the same pack. The whole idea is "hey, we don't allow delta-ing relative to objects not in the pack anyway, so why waste space storing such large pointers?" This would make the currently forbidden case of a delta relative to an object not in the same pack completely impossible. It's like asking for the square root of 2 in roman numerals. > I haven't tried doing this, does the current git code support having > multiple pack files covering difference pieces of the project? Can we > have an archive pack and a local current pack that get merged > together? Yes, it supports it perfectly. A pack is just a bag of objects. There is zero required higher-level structure. In particular, you can have a "100% blobs" pack with no tree or commit objects at all. Git uses the object relationships in its compression heuristics, but it has no effect on correctness. The only thing is that the original pack was a *self-contained* bag of objects; every object in the pack could be reconstructed given only the pack itself. In particular, cross-pack deltas were forbidden. However, then someone realized that making an exception for git protocol updates would be easy and a Really Good Thing. A pack that is missing one or more base objects is called a "thin pack". So the rule I mentioned above has already been broken. Supporting that is what I'm talking about. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html