Linus Torvalds wrote:
On Fri, 9 Jun 2006, Carl Worth wrote:
On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote:
Could you clone the repo and delete changesets earlier than 2004? Then
I would clone the small repo and work with it. Later I decide I want
full history, can I pull from a full repository at that point and get
updated? That would need a flag to trigger it since I don't want full
history to come over if I am just getting updates from someone else's
tree that has a full history.
This is clearly a desirable feature, and has been requested by several
people (including myself) looking to switch some large-ish histories
from an existing system to git.
The thing is, to some degree it's really fundamentally hard.
It's easy for a linear history. What you do for a linear history is to
just get the top commit, and the tree associated with it, and then you
cauterize the parent by just grafting it to go away. Boom. You're done.
The problems are that if the preceding history _wasn't_ linear (or, in
fact, _subsequent_ development refers to it by having branched off at an
earlier point), and you try to pull your updates, the other end (that
knows about all the history) will assume you have all the history that you
don't have, and will send you a pack assuming that.
Which won't even necessarily have all the tree/blob objects (it assumed
you already had them), but more annoyingly, the history won't be
cauterized, and you'll have dangling commits. Which you can cauterize by
hand, of course, but you literally _will_ have to get the objects and
cauterize the thing by hand.
You're right that it's not "fundamentally impossible" to do: the git
format certainly _allows_ it. But the git protocol handshake really does
end up optimizing away all the unnecessary work by knowing that the other
side will have all the shared history, so lacking the shared history will
mean that you're a bit screwed.
Here's an idea. How about separating trees and commits from the actual
blobs (e.g. in separate packs)? My reasoning is that the commits and
trees should only be a small portion of the overall repository size, and
should not be that expensive to transfer. (Of course, this is only a
guess, and needs some numbers to back it up.)
So, a shallow clone would receive all of the tree objects, and all of
the commit objects, and could then request a pack containing the blobs
represented by the current HEAD.
In this way, the user has a history that will show all of the commit
messages, and would be able to see _which_ files have changed over time
e.g. gitk would still work - except for the actual file level diff, "git
log" should also still work, etc
This would also enable other optimisations.
For example, documentation people would only need to get the objects
under the doc/ tree, and would not need to actually check out the
source. Git could detect any actual changes by checking whether it has
the previous blob in its local repository, and whether the file exists
locally. Creating a patch would obviously require that the person checks
out the previous version, but one could theoretically commit a new blob
to a repo without having the previous one (not saying that this would be
a good idea, of course)
This would probably require Eric Biederman's "direct access to blob"
patches, I guess, in order to be feasible.
Regards,
Rogan
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html