Re: [WIP 0/7] CDN offloading of fetch response

Christian Couder <christian.couder@xxxxxxxxx> · Tue, 26 Feb 2019 09:30:28 +0100

Hi,

On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
>
> Christian Couder wrote:
> > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
>
> > Especially I'd like to know what should the client do if they find out
> > that for example a repo that contains a lot of large files is
> > configured so that the large files should be fetched from a CDN that
> > the client cannot use? Is the client forced to find or setup another
> > repo configured differently if the client still wants to use CDN
> > offloading?
>
> The example from that message:
>
>   For example I think the Great Firewall of China lets people in China
>   use GitHub.com but not Google.com. So if people start configuring
>   their repos on GitHub so that they send packs that contain Google.com
>   CDN URLs (or actually anything that the Firewall blocks), it might
>   create many problems for users in China if they don't have a way to
>   opt out of receiving packs with those kind of URLs.
>
> But the same thing can happen with redirects, with embedded assets in
> web pages, and so on.

I don't think it's the same situation, because the CDN offloading is
likely to be used for large objects that some hosting sites like
GitHub, GitLab and BitBucket might not be ok to have them store for
free on their machines. (I think the current limitations are around
10GB or 20GB, everything included, for each user.)

So it's likely that users will want a way to host on such sites
incomplete repos using CDN offloading to a CDN on another site. And
then if the CDN is not accessible for some reason, things will
completely break when users will clone.

You could say that it's the same issue as when a video is not
available on a web page, but the web browser can still render the page
when a video is not available. So I don't think it's the same kind of
issue.

And by the way that's a reason why I think it's important to think
about this in relation to promisor/partial clone remotes. Because with
them it's less of a big deal if the CDN is unavailable, temporarily or
not, for some reason.

> I think in this situation the user would likely
> (and rightly) blame the host (github.com) for requiring access to a
> separate inaccessible site, and the problem could be resolved with
> them.

The host will say that it's repo admins' responsibility to use a CDN
that works for the repo users (or to pay for more space on the host).
Then repo admins will say that they use this CDN because it's simpler
for them or the only thing they can afford or deal with. (For example
I don't think it would be easy for westerners to use a Chinese CDN.)
Then users will likely blame Git for not supporting a way to use a
different CDN than the one configured in each repo.

> The beauty of this is that it's transparent to the client: the fact
> that packfile transfer was offloaded to a CDN is an implementation
> detail, and the server takes full responsibility for it.

Who is "the server" in real life? Are you sure they would be ok to
take full responsibility?

And yes, I agree that transparency for the client is nice. And if it's
really nice, then why not have it for promisor/partial clone remotes
too? But then do we really need duplicated functionality between
promisor remotes and CDN offloading?

And also I just think that in real life there needs to be an easy way
to override this transparency, and we already have that with promisor
remotes.

> This doesn't stop a hosting provider from using e.g. server options to
> allow the client more control over how their response is served, just
> like can be done for other features of how the transfer works (how
> often to send progress updates, whether to prioritize latency or
> throughput, etc).

Could you give a more concrete example of what could be done?

> What the client *can* do is turn off support for packfile URLs in a
> request completely.  This is required for backward compatibility and
> allows working around a host that has configured the feature
> incorrectly.

If the full content of a repo is really large, the size of a full pack
file sent by an initial clone could be really big and many client
machines could not have enough memory to deal with that. And this
suppose that repo hosting providers would be ok to host very large
repos in the first place.

With promisor remotes, it's less of a problem if for example:

- a repo hosting provider is not ok with very large repos,
- a CDN is unavailable,
- a repo admin has not configured some repos very well.

Thanks for your answer,
Christian.