On Sun, 7 Jun 2009, Sam Vilain wrote: > On Fri, 2009-06-05 at 15:28 -0400, Nicolas Pitre wrote: > > Please don't mess up the layers of abstractions. Creating a list of > > objects is currently the biggest bottleneck for big clones. It is not > > the object compression (delta) phase, it is not the pack creation > > either. > [...] > > In other words, you should really concentrate on making 'git rev-list > > --objects' and 'git rev-list --objects-edge' instantaneous without > > having to parse any commit nor tree objects (at least for the part > > already covered by the cache). And as new objects are added through > > 'git add' and 'git commit' or 'git fetch' then the traditional object > > parsing would have to take place only up to the point where the cache > > could take over, and update the cache for the non covered objects while > > at it. I think this is already quite a difficult problem already. > > It's a good thing, then, that this is exactly what the first milestones > in the project are covering. > > What I don't understand though is that during initial clone the object > compression phase does take a lot of time, and that there have been > reports that large initial clones use a lot of VM for the packfile to be > sent. Did you duplicate those results yourself? Did you also try out the simple bench test I suggested while responding to Jakub? For example, let's take the Linux kernel repository as I have it handy here. For the various phases I get: Counting objects = 25 seconds Compressing objects = 2 seconds Writing objects = 3 seconds A long compression phase is a sign of a really badly packed repository. The solution to that is simply to repack. After that subsequent clones as well as further repacks will have much shorter compression phase as the work that was performed in the first repack will be directly reusable. > So what I'm doing is trying to answer the question about why we > can't just start streaming the pack as soon as we know how many objects > will be in it. All those other stages can happen in parallel - for > instance I would wager that it's far more efficient to detect loops as > we're streaming, as the packfile is accessed, rather than having to read > seek object header in the packfile up front. My stance on this is that it is fundamentally impossible to do efficiently, if at all. What we currently do is not loop detection but rather loop avoidance. And because your pack set is not a static thing (new packs are created or fetched, and sometimes they're repacked into a single pack for a while) then you simply can't have a stable cache of object topology based on their storage into packs. This is simply not some immutable data. Only object names and contents are immutable, nothing more. What I'm telling you is: in order to know how many objects the pack which is about to be created will contain, you need to count them. This is what takes 25 seconds currently in the above example. Then a 2 second compression phase which could be even less if your repository is better packed than mine currently is. At which point the writing phase starts and the pack header is streamed. So trying to stream the pack as soon as you know how many objects it contains will save a big whoopee 2 seconds. This is nothing to go crazy about, really. Yet, to be able to stream a pack right away, you still would need to know _where_ to pick the pack data from. And in most cases this is not from a nice single pack but rather multiple ones. And some of those packs have objects which are duplicated into other packs. And yet by mixing pack data from multiple source packs, you gain new opportunities for delta compression because you're now putting together objects which were separated before and therefore couldn't create delta between them previously (that's the 2 second phase above). You still must be careful during that phase not to create delta loop, and to do that you must take into account all the other deltas from existing source packs that you are going to not recompute but just copy straight into the new pack. And of course there is the delta depth limit to enforce. This means that some rearrangements of deltas forces some objects which were deltas previously to become undeltified, or new deltas against a shallower base object. And the only way to know what form each object will take in this context is again by going through the full compression phase which works on a list of object which ordering is totally different from the one that is used to actually store objects in the pack. In other words, you cannot work out delta issues as you stream the pack because you don't want to write objects in the pack with the same order used for delta processing. Why a different object ordering for storing in the pack? Because we want to create the best data layout possible in the new pack so future runtime access to that pack will be optimal. This often means that objects are picked from existing packs in a non linear fashion but rather in a seemingly random way. This is because we want the most recent commits to get the best IO patterns in a pack. However, when the repository is updated through fetches or pushes, the newly obtained packs usually carry even more recent commits for which the bulk of referenced objects are still to be found in older packs (this is why a fetch is so quick). So the more fetches or pushes are performed, the less efficient your repository becomes with regard to the most recent commits. This is why a subsequent repack will reshuffle all objects so that the most recent commit gets a linear IO pattern again by picking objects from the most recent packs as well as from older packs and putting them all contiguously in the new pack. But again you cannot know if those objects will be in delta form or not, or against which base object before the compression phase is over. I hope you have a better idea now to answer the question about why we can't just start streaming the pack as soon as we know how many objects will be in it, and why all those other stages may not happen in parallel. > In summary, I'm trying to make sure that the revision cache contains > whatever information might help with making it start as soon as > possible, where putting the information in is trivial. > > If pack-objects integration is as small a win as you say - and I have no > hard facts to dispute this - then when we come to that part and > investigate I'm sure with the facts in hand we will agree on what > approach to take. Good. I'm glad you see things that way. Because, to me, even just the notion of a good revision cache is not that simple. > > What pack-objects would greatly benefit from is a facility that could > > provide a list of objects with their SHA1, type, size > > Which size? compressed, uncompressed? What does that win us? Again > I'm trying to support all information with a use case. Uncompressed. That information is used by the delta heuristics, and the code currently does its best not to seek all over through delta chains to fetch that tiny bit of information if that can be avoided (have a look at check_object() in builtin-pack-objects.c). > > and the 32-bit > > object path hash > > I'm not sure what this means, can you refer me to some docs or relevant > source? If you don't know what that is, then I'm afraid you might be lacking some background on the packing and delta strategy. I'd suggest you read Documentation/technical/pack-heuristics.txt first, and then find out in the commit logs for builtin-pack-objects.c (or even within the file itself) what has changed since then. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html