Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 05 Dec 2018 14:02:06 +0900

Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU usage
> +(for example, by serving some data through a CDN), and (in the future) provides
> +some measure of resumability to clients.

Without reading the remainder, this makes readers anticipate a few
good things ;-)

 - "part of", so pre-generated constant material can be given from
   CDN and then followed-up by "filling the gaps" small packfile,
   perhaps?

 - The "part of" transmission may not bring the repository up to
   date wrt to the "want" objects; would this feature involve "you
   asked history up to these commits, but with this pack-uri, you'll
   be getting history up to these (somewhat stale) commits"?

Anyway, let's read on.

> +This feature is available only in protocol version 2.
> +
> +Protocol
> +--------
> +
> +The server advertises `packfile-uris`.
> +
> +If the client replies with the following arguments:
> +
> + * packfile-uris
> + * thin-pack
> + * ofs-delta

"with the following" meaning "with all of the following", or "with
any of the following"?  Is there a reason why the server side must
require that the client understands and is willing to accept a
thin-pack when wanting to use packfile-uris?  The same question for
the ofs-delta.

When the pregenerated constant material the server plans to hand out
the uris for was prepared by using ofs-delta encoding, the server
cannot give the uri to it when the client does not want ofs-delta
encoded packfile, but it feels somewhat strange that we require the
most capable client at the protocol level.  After all, the server
side could prepare one with ofs-delta and another without ofs-delta
and depending on what the client is capable of, hand out different
URIs, if it wanted to.

The reason why I care is because thin and ofs will *NOT* stay
forever be the only optional features of the pack format.  We may
invent yet another such optional 'frotz' feature, which may greatly
help the efficiency of the packfile encoding, hence it may be
preferrable to always generate a CDN packfile with that feature, in
addition to thin and ofs.  Would we add 'frotz' to the above list in
the documentation, then?  What would happen to existing servers and
clients written before that time then?

My recommendation is to drop the mention of "thin" and "ofs" from
the above list, and also from the following paragraph.  The "it MAY
send" will serve as a practical escape clause to allow a server/CDN
implementation that *ALWAYS* prepares pregenerated material that can
only be digested by clients that supports thin and ofs.  Such a server
can send packfile-URIs only when all of the three are given by the
client and be compliant.  And such an update to the proposed document
would allow a more diskful server to prepare both thin and non-thin
pregenerated packs and choose which one to give to the client depending
on the capability.

> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.

So, this is OK, but

> +Clients then should understand that the returned packfile could be incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack and
> +which may contain offset deltas).

weaken or remove the (parenthetical comment) in the last sentence,
and replace the beginning of the section with something like

	If the client replies with 'packfile-uris', when the server
	sends the packfile, it MAY send a `packfile-uris` section...

You may steal what I wrote in the above response to help the
server-side folks to decide how to actually implement the "it MAY
send a packfile-uris" part in the document.

> +Server design
> +-------------
> +
> +The server can be trivially made compatible with the proposed protocol by
> +having it advertise `packfile-uris`, tolerating the client sending
> +`packfile-uris`, and never sending any `packfile-uris` section. But we should
> +include some sort of non-trivial implementation in the Minimum Viable Product,
> +at least so that we can test the client.
> +
> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> +with the given sha1 can be replaced by the given URI. This allows, for example,
> +servers to delegate serving of large blobs to CDNs.

;-)

> +Client design
> +-------------
> +
> +While fetching, the client needs to remember the list of URIs and cannot
> +declare that the fetch is complete until all URIs have been downloaded as
> +packfiles.
> +
> +The division of work (initial fetch + additional URIs) introduces convenient
> +points for resumption of an interrupted clone - such resumption can be done
> +after the Minimum Viable Product (see "Future work").
> +
> +The client can inhibit this feature (i.e. refrain from sending the
> +`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.

OK, this comes back to what I alluded to at the beginning.  We could
respond to a full-clone request by feeding a series of packfile-uris
and some ref information, perhaps like this:

	* Grab this packfile and update your remote-tracking refs
          and tags to these values; you'd be as if you cloned the
          project when it was at v1.0.

	* When you are done with the above, grab this packfile and
          update your remote-tracking refs and tags to these values;
          you'd be as if you cloned the project when it was at v2.0.

	* When you are done with the above, grab this packfile and
          update your remote-tracking refs and tags to these values;
          you'd be as if you cloned the project when it was at v3.0.

	...

	* When you are done with the above, here is the remaining
          packdata to bring you fully up to date with your original
          "want"s.

and before fully reading the proposal, I anticipated that it was
what you were going to describe.  The major difference is "up to the
packdata given to you so far, you'd be as if you fetched these" ref
information, which would allow you to be interrupted and then simply
resume, without having to remember the set of packfile-uris yet to
be processed across a fetch/clone failure.  If you sucessfully fetch
packfile for ..v1.0, you can update the remote-tracking refs to
match as if you fetched back when that was the most recent state of
the project, and then if you failed while transferring packfile for
v1.0..v2.0, the resuming would just reissue "git fetch" internally.

I think what you proposed, i.e. without the "with the data up to
this packfile, you have history to these objects", would also work,
even though it requires us to remember more of what we learned
during the initial attempt throughout retrying failed transfers.

> +Future work
> +-----------
> +
> +The protocol design allows some evolution of the server and client without any
> +need for protocol changes, so only a small-scoped design is included here to
> +form the MVP. For example, the following can be done:
> +
> + * On the server, a long-running process that takes in entire requests and
> +   outputs a list of URIs and the corresponding inclusion and exclusion sets of
> +   objects. This allows, e.g., signed URIs to be used and packfiles for common
> +   requests to be cached.
> + * On the client, resumption of clone. If a clone is interrupted, information
> +   could be recorded in the repository's config and a "clone-resume" command
> +   can resume the clone in progress. (Resumption of subsequent fetches is more
> +   difficult because that must deal with the user wanting to use the repository
> +   even after the fetch was interrupted.)
> +
> +There are some possible features that will require a change in protocol:
> +
> + * Additional HTTP headers (e.g. authentication)
> + * Byte range support
> + * Different file formats referenced by URIs (e.g. raw object)
> +
> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> index 345c00e08c..2cb1c41742 100644
> --- a/Documentation/technical/protocol-v2.txt
> +++ b/Documentation/technical/protocol-v2.txt
> @@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is sent.
>  
>      output = acknowledgements flush-pkt |
>  	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
> -	     [wanted-refs delim-pkt] packfile flush-pkt
> +	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
> +	     packfile flush-pkt
>  
>      acknowledgments = PKT-LINE("acknowledgments" LF)
>  		      (nak | *ack)
> @@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is sent.
>  		  *PKT-LINE(wanted-ref LF)
>      wanted-ref = obj-id SP refname
>  
> +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> +    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
> +
>      packfile = PKT-LINE("packfile" LF)
>  	       *PKT-LINE(%x01-03 *%x00-ff)