As described in "Background" below, there have been at least 2 patch
sets to support "partial clones" and on-demand blob fetches, where the
server part that supports on-demand blob fetches was treated at least in
outline. Here is a proposal treating that server part in detail.
== Background
The desire for Git to support (i) missing blobs and (ii) fetching them
as needed from a remote repository has surfaced on the mailing list a
few times, most recently in the form of RFC patch sets [1] [2].
A local repository that supports (i) will be created by a "partial
clone", that is, a clone with some special parameters (exact parameters
are still being discussed) that does not download all blobs normally
downloaded. Such a repository should support (ii), which is what this
proposal describes.
== Design
A new endpoint "server" is created. The client will send a message in
the following format:
----
fbp-request = PKT-LINE("fetch-blob-pack")
1*want
flush-pkt
want = PKT-LINE("want" SP obj-id)
----
The client may send one or more SHA-1s for which it wants blobs, then a
flush-pkt.
The server will then reply:
----
server-reply = flush-pkt | PKT-LINE("ERR" SP message)
----
If there was no error, the server will then send them in a packfile,
formatted like described in "Packfile Data" in pack-protocol.txt with
"side-band-64k" enabled.
Any server that supports "partial clone" will also support this, and the
client will automatically assume this. (How a client discovers "partial
clone" is not covered by this proposal.)
The server will perform reachability checks on requested blobs through
the equivalent of "git rev-list --use-bitmap-index" (like "git
upload-pack" when using the allowreachablesha1inwant option), unless
configured to suppress reachability checks through a config option. The
server administrator is highly recommended to regularly regenerate the
bitmap (or suppress reachability checks).
=== Endpoint support for forward compatibility
This "server" endpoint requires that the first line be understood, but
will ignore any other lines starting with words that it does not
understand. This allows new "commands" to be added (distinguished by
their first lines) and existing commands to be "upgraded" with backwards
compatibility.
=== Related improvements possible with new endpoint
Previous protocol upgrade suggestions have had to face the difficulty of
allowing updated clients to discover the server support while not
slowing down (for example, through extra network round-trips) any
client, whether non-updated or updated. The introduction of "partial
clone" allows clients to rely on the guarantee that any server that
supports "partial clone" supports "fetch-blob-pack", and we can extend
the guarantee to other protocol upgrades that such repos would want.
One such upgrade is "ref-in-want" [3]. The full details can be obtained
from that email thread, but to summarize, the patch set eliminates the
need for the initial ref advertisement and allows communication in ref
name globs, making it much easier for multiple load-balanced servers to
serve large repos to clients - this is something that would greatly
benefit the Android project, for example, and possibly many others.
Bundling support for "ref-in-want" with "fetch-blob-pack" simplifies
matters for the client in that a client needs to only handle one
"version" of server (a server that supports both). If "ref-in-want" were
added later, instead of now, clients would need to be able to handle two
"versions" (one with only "fetch-blob-pack" and one with both
"fetch-blob-pack" and "ref-in-want").
As for its implementation, that email thread already contains a patch
set that makes it work with the existing "upload-pack" endpoint; I can
update that patch set to use the proposed "server" endpoint (with a
"fetch-commit-pack" message) if need be.
== Client behavior
This proposal is concerned with server behavior only, but it is useful
to envision how the client would use this to ensure that the server
behavior is useful.
=== Indication to use the proposed endpoint
The client will probably already record that at least one of its remotes
(the one that it successfully performed a "partial clone" from) supports
this new endpoint (if not, it can’t determine whether a missing blob was
caused by repo corruption or by the "partial clone"). This knowledge can
be used both to know that the server supports "fetch-blob-pack" and
"fetch-commit-pack" (for the latter, the client can fall back to
"fetch-pack"/"upload-pack" when fetching from other servers).
=== Multiple remotes
Fetches of missing blobs should (at least by default?) go to the remote
that sent the tree that points to them. This means that if there are
multiple remotes, the client needs to remember which remote it learned
about a given missing blob from.
== Alternatives considered
The "fetch-blob-pack" and "fetch-commit-pack" messages could be split
into their own endpoints. It seemed more reasonable to combine them
together since they serve similar use cases (large repos), and (for
example) reduces the number of binaries in PATH, but I do not feel
strongly about this.
The client could supply commit information about the blobs it wants (or
other information that could help the reachability analysis). However,
these lines wouldn’t be used by the proposed server design. And if we do
discover that these lines are useful, the protocol could be extended
with new lines that contain this information (since old servers will
ignore all lines that they do not understand).
We could extend "upload-pack" to allow blobs in "want" lines instead of
having a new endpoint. Due to a quirk in the Git implementation (but
possibly not other implementations like JGit), this is already supported
[4]. However, each invocation would require the server to generate an
unnecessary ref list, and would require both the server and the client
to undergo more network traffic.
Also, the new "server" endpoint might be made to be discovered through
another mechanism (for example, a capability advertisement on another
endpoint). It is probably simpler to tie it to the "partial clone"
feature, though, since they are so likely to be used together.
[1] <20170304191901.9622-1-markbt@xxxxxxxxxx>
[2] <1488999039-37631-1-git-send-email-git@xxxxxxxxxxxxxxxxx>
[3] <cover.1485381677.git.jonathantanmy@xxxxxxxxxx>
[4] <20170309003547.6930-1-jonathantanmy@xxxxxxxxxx>