Re: Change set based shallow clone

Junio C Hamano <junkio@xxxxxxx> · Thu, 07 Sep 2006 22:30:51 -0700

"Martin Langhoff" <martin.langhoff@xxxxxxxxx> writes:

> On 9/8/06, Junio C Hamano <junkio@xxxxxxx> wrote:
> ...
>> Anyway, it is a useful feature, an important operation, and
>> it involves a lot of thinking (and hard work, obviously).
>>
>> I will not think about this anymore, until I have absolutely
>> nothing else to do for a solid few days, but you probably
>> haven't read this far ;-).
>
> For when you come back, then. I agree this is rather complicated, so
> much so that I suspect that we may be barking up the wrong tree.
>
> People who want shallow clones are actually asking for a "light"
> clone, in terms of "how much do I need to download". If everyone has
> the whole commit chain (but may be missing olden blobs, and even
> trees), the problem becomes a lot easier.

No, I do not think so.  You are just pushing the same problem to
another layer.

Reachability through commit ancestry chain is no more special
than reachability through commit-tree or tree-containment
relationships.  The grafts mechanism happen to treat commit
ancestry more specially than others, but that is not inherent
property of git [*1*].  It is only an implementation detail (and
limitation -- grafts cannot express anything but commit
ancestry) of the current system.

But let's touch a slightly different but related topic first.
People do not ask for shallow clones.  They just want faster
transfer of "everything they care about".  Shallow and lazy
clones are implementation techniques to achieve the goal of
making everything they care about appear available, typically
taking advantage of the fact that people care about recent past
a lot more than they do about ancient history.

Shallow clones do so by explicitly saying "sorry you told me
earlier you do not care about things older than X so if you now
care about things older than X please do such and such to deepen
your history".  What you are saying is a variant of shallow that
says "NAK; commits I have but trees and blobs associated with
such an old commit you have to ask me to retrieve from
elsewhere".  Lazy clones do so by "faulting missing objects on
demand" [*2*].  That is essentially automating the "please do
such and such" part.  So they are all not that different.

Now, first and foremost, while I would love to have a system
that gracefully operates with a "sparse" repository that lacks
objects that should exist from tag/commit/tree/blob reachability
point of view, it is an absolute requirement that I can tell why
objects are missing from a repository when I find some are
missing by running fsck-objects [*3*].  If repository is a
shallow clone, not having some object may be expected, but I
want to be able to tell repository corruption locally even in
that case, so "assume lack of object is perhaps it was not
downloaded by shallow cloning" is an unacceptable attitude, and
"when you find an object missing from your repository, you can
ask the server [*4*], and if the server does not have it then
you know your repository is corrupt otherwise it is Ok" is a
wrong answer.

I talked about the need of upload-pack protocol extension to
exchange grafts information between uploader and downloader to
come up with the resulting graft and update the grafts in
downloaded repository after objects tranfer is done.  It is
needed because by having such cauterizing graft entries I can be
sure that the repository is not corrupt when fsck-objects that
does look at grafts says "nothing is missing".  Jon talked about
"fault-in on demand and leave stub objects until downloading the
real thing"; those stub objects are natural extension of grafts
facility but extends to trees and blobs.  Either of these would
allow me to validate the sparse repository locally.

Now I'll really shut up ;-).

[*1*] For example, rev-list --object knows that it needs to show
the tree-containment when given only a tree object without any commit.

[*2*] You probably could write a FUSE module that mount on
your .git/objects, respond to accesses to .git/objects/??
directory and files with 38-byte hexadecimal names under them by
talking to other repositories over the wire.  No, I am not going
to do it, and I will probably not touch it even with ten-foot
pole for reasons I state elsewhere in this message.

[*3*] The real fsck-objects always looks at grafts.  The
fictitious version of fsck-objects I am talking about here is
the one that uses parents recorded on the "parent " line of
commit objects, that is, the true object level reachability.

[*4*] In git, there is no inherent server vs client or upstream
vs downstream relationship between repositories.  You may be
even fetching from many people and do not have a set upstream at
all.  Or you are _the_ upstream, and your notebook has the
latest devevelopment history, and after pushing that latest
history to your mothership repository, you may decide you do not
want ancient development history on a puny notebook, and locally
cauterize the history on your notebook repository and prune
ancient stuff.  The objects missing from the notebook repository
are retrievable from your mothership repository again if/when
needed and you as the user would know that (and if you are lucky
you may even remember that a few months later), but git doesn't,
and there is no reason for git to want to know it.  If we want
to do "fault-in on demand", we need to add a way to say "it is
Ok for this repository to be incomplete -- objects that are
missing must be completed from that repository (or those
repositories)".  But that's quite a special case of having a
fixed upstream.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html