On Fri, 28 Jan 2011, Shawn Pearce wrote: > This started because I was looking for a way to speed up clones coming > from a JGit server. Cloning the linux-2.6 repository is painful, it > takes a long time to enumerate the 1.8 million objects. So I tried > adding a cached list of objects reachable from a given commit, which > speeds up the enumeration phase, but JGit still needs to allocate all > of the working set to track those objects, then go find them in packs > and slice out each compressed form and reformat the headers on the > wire. Its a lot of redundant work when your kernel repository has > 360MB of data that you know a client needs if they have asked for your > master branch with no "have" set. > > Later I realized, we can get rid of that cached list of objects and > just use the pack itself. Its far cleaner, as there is no redundant > cache. But either way (object list or pack) its a bit of a challenge > to automatically identify the right starting points to use. Linus > Torvalds' linux-2.6 repository is the perfect case for the RFC I > posted, its one branch with all of the history, and it never rewinds. > But maybe Linus is just very unique in this world. :-) Playing my old record again... I know. But pack v4 should solve a big part of this enumeration cost. I've changed the format slightly again in my WIP branch. The idea is to: 1) Have a non compressed yet still really dense representation for tree objects; 2) do the same thing for the first part of commit objects, and only deflate the free form text part. There is nothing new here. However, it should be possible to: 3) replace all SHA1 references by an offset into the pack file directly, just like we do for OFS_DELTA objects. If the SHA1 is actually needed then we can obtain it with a reverse lookup with given object offset in the pack index file, but in practice that is not actually required that often. So walking the history graph and enumerating objects would require nothing more than simply following straight pointers in the pack data in 99% of the cases. No object decompression, no memory buffer allocation/deallocation to perform that decompression, no string parsing in the tree object case, etc. Only cross pack references would require a full SHA1 based lookup like we do now. I still have to sit down and figure out the implications of this, especially with forward references, meaning that the offset might have to be an object index so to allow for variable length encoding, and also to make sure index-pack can reconstruct the pack index. But that would only be an indirect lookup which shouldn't be significantly costly. So that's the idea. Keep the exact same functionality as we have now, without any need for cache management, but making the data structure in a form that should improve object enumeration by some magnitude. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html