Richard Curnow <rc@xxxxxxxxxx> writes: > <linux <at> horizon.com> writes: > >> For regular packs, such objects wouldn't even be present, because >> all base objects are in the pack itself. > > It would actually be useful if this restriction were lifted. The primary reason why we require the base object to be present in the same pack is because our assumption has always been we mmap a pack as a whole, and we assumed that we can make sure that the address space of a process to be enough to mmap at least one pack (otherwise the user will be given a way to keep size of each pack down to a size that is a mmapable piece) but not necessarily more than one. In other words, while unpacking one object that is deltified, we wanted to make sure everything that is necessary is availble in memory. Otherwise, when multiple packs are needed to reconstitute a single object because the object is a delta in one pack and its base is in the other pack, we might not be able to mmap both of them into the address space at the same time. The "mmap only parts of a huge packfile into map windows" update Shawn Pearce is working on would lift that worry, so it would be less of a reason. Having said that, people must realize that keeping thin packs on disk, indexed, is a major undertaking. It can be made to work, but it will be painful. People can grab a thin pack via rsync or http without realizing it is thin, try to use it and then later (possibly much later, when a delta that is against a missing object happens to be needed) discover some of the pack data is unusable. To avoid this, the dumb transports need to walk the ancestry chains of commits and their associated objects while fetching as they do right now, but also need to walk the delta chains to make sure their base objects are available. I like the proposal by Linus to have stub objects that records what the missing base object is in the thin pack stream. It simplifies things for the offset representation of bases. But if we were to do so, I suspect indexing such a thing would need more surgery to the .idx format. The question is if we should record the stub object in the .idx file. Right now we record the beginning of data for each object (either base or delta) without storing anything to say about where it ends -- the pack is assumed to be dense, so the point where next object begins should be where the current one ends. This means that we need to build a reverse index to list the objects in the pack in offset order, when we want to find the end of each object efficiently. We work this around when using a pack by not doing the full verification. We start decoding the header from the beginning and in either base or delta the deflated stream sits at the end. We inflate that deflated stream until zlib says it got enough, and we do not check if that "enough" point stepped into the location that is supposed to be inside the "next" object (if it did so then it is an error in the packfile we missed to detect). If things inflate well and delta applies cleanly, we are happy. But pack-objects, when reusing the data from an existing pack, wants to be careful, so it _does_ build a reverse index itself and makes sure that inflate() ends at the location we expect it to. For that process to remain the correct way to determine where each object ends, the .idx file must record the presense of stub objects. However, if we write the names of stub objects in the .idx file, we have another problem. Currently, the existence of an object name in the .idx means that such an object exists in the corresponding .pack; this will not hold true anymore. So if we were to do this, we need to give one extra bit to each of the .idx entries to mark which ones are stub and which ones are not. This would allow the reverse index to properly notice where each object ends, and allow the dumb transports to walk the .idx for missing base objects efficiently. The code that looks at .idx at runtime to determine which pack has needed object needs to notice this bit. Again, all of this can be made to work, but it will be painful. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html