Re: [WIP 0/7] CDN offloading of fetch response

Christian Couder <christian.couder@xxxxxxxxx> · Mon, 4 Mar 2019 09:54:07 +0100

Hi,

On Fri, Mar 1, 2019 at 12:21 AM Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
>
> Sorry for the slow followup.  Thanks for probing into the design ---
> this should be useful for getting the docs to be clear.
>
> Christian Couder wrote:
>
> > So it's likely that users will want a way to host on such sites
> > incomplete repos using CDN offloading to a CDN on another site. And
> > then if the CDN is not accessible for some reason, things will
> > completely break when users will clone.
>
> I think this would be a broken setup --- we can make it clear in the
> protocol and server docs that you should only point to a CDN for which
> you control the contents, to avoid breaking clients.

We can say whatever in the docs, but in real life if it's
simpler/cheaper for repo admins to use a CDN for example on Google and
a repo on GitHub, they are likely to do it anyway.

> That doesn't prevent adding additional features in the future e.g. for
> "server suggested alternates" --- it's just that I consider that a
> separate feature.
>
> Using CDN offloading requires cooperation of the hosting provider.
> It's a way to optimize how fetches work, not a way to have a partial
> repository on the server side.

We can say whatever we want about what it is for. Users are likely to
use it anyway in the way they think it will benefit them the most.

> > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
>
> >> This doesn't stop a hosting provider from using e.g. server options to
> >> allow the client more control over how their response is served, just
> >> like can be done for other features of how the transfer works (how
> >> often to send progress updates, whether to prioritize latency or
> >> throughput, etc).
> >
> > Could you give a more concrete example of what could be done?
>
> What I mean is passing server options using "git fetch --server-option".
> For example:
>
>         git fetch -o priority=BATCH origin master
>
> or
>
>         git fetch -o avoid-cdn=badcdn.example.com origin master
>
> The interpretation of server options is up to the server.

If you often have to tell things like "-o
avoid-cdn=badcdn.example.com", then how is it better than just
specifying "-o usecdn=goodcdn.example.com" or even better using the
remote mechanism to configure a remote for goodcdn.example.com and
then configuring this remote to be used along the origin remote (which
is what many promisor remotes is about)?

> >> What the client *can* do is turn off support for packfile URLs in a
> >> request completely.  This is required for backward compatibility and
> >> allows working around a host that has configured the feature
> >> incorrectly.
> >
> > If the full content of a repo is really large, the size of a full pack
> > file sent by an initial clone could be really big and many client
> > machines could not have enough memory to deal with that. And this
> > suppose that repo hosting providers would be ok to host very large
> > repos in the first place.
>
> Do we require the packfile to fit in memory?  If so, we should fix
> that (to use streaming instead).

Even if we stream the packfile to write it, at one point we have to use it.

And I could be wrong but I think that mmap doesn't work on Windows, so
I think we will just try to read the whole thing into memory. Even on
Linux I don't think it's a good idea to mmap a very large file and
then use some big parts of it which I think we will have to do when
checking out the large files from inside the packfile.

Yeah, we can improve that part of Git too. I think though that it
means yet another thing (and not an easy one) that needs to be
improved before CDN offloading can work well in the real world.

I think that the Git "development philosophy" since the beginning has
been more about adding things that work well in the real world first
even if they are small and a bit manual, and then improving on top of
those early things, rather than adding a big thing that doesn't quite
work well in the real world but is automated and then improving on
that.