Re: With big repos and slower connections, git clone can be hard to work with

ellie <el@xxxxxxxxxxx> · Mon, 8 Jul 2024 14:41:48 +0200

On 7/8/24 2:30 PM, rsbecker@xxxxxxxxxxxxx wrote:
On Sunday, July 7, 2024 10:28 PM, ellie wrote:
I was intending to suggest that depending on the largest object in the repository,
resume may remain a concern for lower end users. My apologies for being unclear.

As for my concrete problem, I can only guess what's happening, maybe github's
HTTPS proxy too eagerly discarding slow connections:

$ git clone https://github.com/maliit/keyboard maliit-keyboard Cloning into 'maliit-
keyboard'...
remote: Enumerating objects: 23243, done.
remote: Counting objects: 100% (464/464), done.
remote: Compressing objects: 100% (207/207), done.
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly:
CANCEL (err 8)
error: 2507 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

A deepen seems to fail for this repo since one deepen step already gets killed off. Git
HTTPS clones from any other hoster I tried, including gitlab.com, work fine, as do git
SSH clones from github.com.

Sorry for the long tangent. Basically, my point was just that resume still seems like a
good idea even with deepen existing.

Regards,

Ellie

On 7/8/24 3:27 AM, rsbecker@xxxxxxxxxxxxx wrote:
On Sunday, July 7, 2024 7:42 PM, ellie wrote:
I have now encountered a repository where even --deepen=1 is bound to
be failing because it pulls in something fairly large that takes a
few minutes. (Possibly, the server proxy has a faulty timeout setting
that punishes slow connections, but for connections unreliable on the
client side the problem would be the same.)

So this workaround sadly doesn't seem to cover all cases of resume.

Regards,

Ellie

On 6/8/24 2:46 AM, ellie wrote:
The deepening worked perfectly, thank you so much! I hope a resume
will still be considered however, if even just to help out newcomers.

Regards,

Ellie

On 6/8/24 2:35 AM, rsbecker@xxxxxxxxxxxxx wrote:
On Friday, June 7, 2024 8:03 PM, ellie wrote:
Subject: Re: With big repos and slower connections, git clone can
be hard to work with

Thanks, this is very helpful as an emergency workaround!

Nevertheless, I usually want the entire history, especially since
I wouldn't mind waiting half an hour. But without resume, I've
encountered it regularly that it just won't complete even if I
give it the time, while way longer downloads in the browser would.
The key problem here seems to be the lack of any resume.

I hope this helps to understand why I made the suggestion.

Regards,

Ellie

On 6/8/24 1:33 AM, rsbecker@xxxxxxxxxxxxx wrote:
On Friday, June 7, 2024 7:28 PM, ellie wrote:
I'm terribly sorry if this is the wrong place, but I'd like to
suggest a potential issue with "git clone".

The problem is that any sort of interruption or connection
issue, no matter how brief, causes the clone to stop and leave nothing
behind:

$ git clone https://github.com/Nheko-Reborn/nheko
Cloning into 'nheko'...
remote: Enumerating objects: 43991, done.
remote: Counting objects: 100% (6535/6535), done.
remote: Compressing objects: 100% (1449/1449), done.
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly:
CANCEL (err 8)
error: 2771 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output $ cd nheko
bash: cd: nheko: No such file or director

In my experience, this can be really impactful with 1. big
repositories and 2.
unreliable internet - which I would argue isn't unheard of! E.g.
a developer may work via mobile connection on a business trip.
The result can even be that a repository is uncloneable for some users!

This has left me in the absurd situation where I was able to
download a tarball via HTTPS from the git hoster just fine, even
way larger binary release items, thanks to the browser's HTTPS
resume. And yet a simple git clone of the same project failed repeatedly.

My deepest apologies if I missed an option to fix or address this.
But summed up, please consider making git clone recover from hiccups.

Regards,

Ellie

PS: I've seen git hosters have apparent proxy bugs, like timing
out slower git clone connections from the server side even if
the transfer is ongoing. A git auto-resume would reduce the
impact of that, too.

I suggest that you look into two git topics: --depth, which
controls how much
history is obtained in a clone, and sparse-checkout, which
describes the part of the repository you will retrieve. You can
prune the contents of the repository so that clone is faster, if
you do not need all of the history, or all of the files. This is
typically done in complex large repositories, particularly those
used for production support as release repositories.

Consider doing the clone with --depth=1 then using git fetch
--depth=n as the resume. There are other options that effectively
give you a resume, including --deepen=n.

Build automation, like Jenkins, uses this to speed up the clone/checkout.

Can you please provide more details on this? It is difficult to understand your issue
without knowing what situation is failing? What size file? Is this a large single pack
file? Can you reproduce this with a script we can try?

First, for this mailing list, please put your replies at the bottom.

Second, the full clone takes under 5 seconds on my system and does not experience any error that you are seeing. I suggest that your ISP may be throttling your account. I have seen this happen on some ISPs under SSH but few under HTTPS. It is likely a firewall or as you said, a proxy setting. GitHub has no proxy.

My suggestion is that this is more of a communication issue instead of than a large repo issue. 133Mb is a relatively small a repository and clones quickly. This might be something to take up on the GitHub support forums rather that for git - since it seems like something in the path outside of git is not working correctly. None of the files in this repository, including pack-files is larger than 100 blocks, so there is not much point with a mid-pack restart.

I apologize for not placing the responses where expected.

It seems extremely unlikely to me to be possibly an ISP issue, for which 
I already listed the reasons. An additional one is HTTPS downloads from 
github outside of git, e.g. from zip archives, for way larger files work 
fine as well.

Nevertheless, this irrelevant to my initial request. Since even if it's 
not caused by a Github server side issue, a resume would still help.

Regards,

Ellie