Re: With big repos and slower connections, git clone can be hard to work with

ellie <el@xxxxxxxxxxx> · Sat, 8 Jun 2024 13:05:47 +0200

I see! Unfortunate, but I'm thankful for your detailed explanation.

The "shallow-cloning and deepening is [...] expensive for the server" 
makes me sadder about the current situation. I don't like that I need to 
make the server's life hard just because my connection is shaky... :-|

> It's possible the client could do some analysis to see if it has
> complete segments of history. In practice it won't, because of the way
> we order packfiles (it's split by type, and then roughly
> reverse-chronological through history). If the server re-ordered its
> response to fill history from the bottom up, it would be possible.

I wonder if that would be the most feasible idea, if any at all...?

My main take-away is that I don't know enough to suggest a good way out, 
and that git is even more impressive and complex tech than I thought. 
Thanks so much for the detailed responses, and I hope at least some of 
my uninformed rambling was of any use.

Regards,

Ellie

On 6/8/24 12:35 PM, Jeff King wrote:
On Sat, Jun 08, 2024 at 11:40:47AM +0200, ellie wrote:

Sorry if I'm misunderstanding, and I assume this is a naive suggestion that
may not work in some way: but couldn't git somehow retain all the objects it
already has fully downloaded cached? And then otherwise start over cleanly
(and automatically), but just get the objects it already has from the local
cache? In practice, that might already be enough to get through a longer
clone despite occasional hiccups.

The problem is that the client/server communication does not share an
explicit list of objects. Instead, the client tells the server some
points in the object graph that it wants (i.e., the tips of some
branches that it wants to fetch) and that it already has (existing
branches, or nothing in the case of a clone), and then the server can do
its own graph traversal to figure out what needs to be sent.

When you've got a partially completed clone, the client can figure out
which objects it received. But it can't tell the server "hey, I have
commit XYZ, don't send that". Because the server would assume that
having XYZ means that it has all of the objects reachable from there
(parent commits, their trees and blobs, and so on). And the pack does
not come in that order.

And even if there was a way to disable reachability analysis, and send a
"raw" set of objects that we already have, it would be prohibitively
large. The full set of sha1 hashes for linux.git is over 200MB. So
naively saying "don't send object X, I have it" would approach that
size.

It's possible the client could do some analysis to see if it has
complete segments of history. In practice it won't, because of the way
we order packfiles (it's split by type, and then roughly
reverse-chronological through history). If the server re-ordered its
response to fill history from the bottom up, it would be possible. We
don't do that now because it's not really the optimal order for
accessing objects in day-to-day use, and the packfile the server sends
is stored directly on disk by the client.

-Peff