On 7 Sep 2006 21:01:12 -0400, linux@xxxxxxxxxxx <linux@xxxxxxxxxxx> wrote:
> When the client wants a shallow clone it starts by telling the server > all of the HEADs and how many change sets down each of those HEADs it > has locally. That's a small amout of data to transmit and it can be > easily tracked. Let's ignore merged branches for the moment. When you say "change set", I'm going to assume you mean "commit object". Okay. Now, the server hasn't heard of one or more of those commit objects, because they're local changes. What then?
Toss them, if they don't exist on the server the server is going to be able to send any objects for them.
Another issue is that a client with a nearly-full copy has to do a full walk of its history to determine the depth count that it has. That can be more than 2500 commits down in the git repository, and worse in the mozilla one. It's actually pretty quick (git-show-branch --more=99999 will do it), but git normally tries to avoid full traversals like the plague
Client would track this incrementally and not recompute it each time.
Oh, and was "for the moment" supposed to last past the end of your e-mail? I don't see what to do if there's a merge in the history and the depth on different sides is not equal. E.g. the history looks like: ...a---b---c---d---e---f / \ ...w---x---y HEAD / ...p---q---r---s Where "..." means that there are ancestors, but they're missing. > If you haven't updated for six months when the server walks backwards > for 10 change sets it's not going to find anything you have locally. > When this situation is encountered the server needs to generate a > delta just for you between one of the change sets it knows you have > and one of the 10 change sets you want. By generating this one-off > delta it lets you avoid the need to fetch all of the objects back to a > common branch ancestor. The delta functions as a jump over the > intervening space. Your choice of words keeps giving me the impression that you believe that a "change set" is a monolithic object that includes all the changes made to all the files. It's neither monolithic nor composed of changes. A commit objects consists soley of metadata, and contains a pointer to a tree object, which points recursively to the entire project state at the time of the commit.
I was using change set to refer to snapshot.
There is massive sharing of component objects between successive commits, but they are NOT stored as deltas relative to one another.
Yes, most of the sharing occurs via the tree structures.
The pack-forming heuristics tend to achieve that effect, but it is not guaranteed or required by design. Please understand that, deep in your bones: git is based on snapshots, not deltas. But okay, so we've sent the client the latest 10 commits, with a dangling tail at the bottom. (The files may have been sent as deltas against the old state, or just fresh compressed copies, but that doesn't matter.) Then the heads like "origin" have been advanced. So the old commit history is now unreferenced garbage; nothing points to it, and it will be deleted next time git-prune is run. Is that the intended behavior? Or should updates to an existing clone always complete the connections?
If you follow the links in what looks to be a dangling object sooner or latter you will run into the root object or a 'not present' object. If you hit one of those the objects are not dangling and should be preserved. Here is another way to look at the shallow clone problem. The only public ids in a git tree are the head and tag pointers. Send these to the client. Now let's modify the git tools to fault the full objects in one by one from the server whenever a git operation needs the object. Dangling references would point to 'not-present' objects. For a typical user using a model like this, how much of the Mozilla repository would end up being faulted into their machine? Mozilla has 2M objects and 250K commits in a 450MB pack. My estimate is that a typical user is going to touch less than 200K of the objects and maybe less than 100K. Of course always faulting in full objects is wasteful. A smart scheme would be to try and anticipate with some read ahead and figure out ways to send deltas. Tools like gitk would need to only touch the objects needed to draw the screen and not run the full commit chain at startup. This experiment can be done fairly easily. Put all of the kernel source into a single pack file. Modify the git tools to set a bit in the index file if an object is accessed. Use the pack for a few days and then dump out the results. -- Jon Smirl jonsmirl@xxxxxxxxx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html