[RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Per recent discussion[1] this is my not-quite-feature-complete version
of the bundle-uri capability.  This was sent to the list in some form
beforfe in [2] and [3].

Recently Derrick Stolee has sent an alternate implementation of some
of the same ideas in [4]. Per [1] we're planning to work together on
getting a version of this into git that makes everyone happy, sending
what I've got here is the first step in that.

A high-level summary of the important differences in my approach &
Derrick's (which I hope I'm summarizing fairly here) is that his
approach optionally adds a bundle TOC format, that format allows you
to define topology relationships between bundles to guide a
(returning) client in what it needs to fetch.

Whereas the idea in this series is to lean entirely on the client
downloading bundles & inferring what needs to be done via the
tip/prereqs listed in the header format of the (existing, not changed
here) bundle format.

Both have pros & cons, I started trying to summarize those, but let's
leave that for later.

There's also some high-level "journey" differences in the
two. E.g. Derrick implemented the ability to have "git bundle" itself
fetch bundles remotely, I don't have that and instead lean entirely on
the protocol and "fetch". Those differences really aren't important,
and we can have our cake & eat it too on that front. I.e. end up with
some sensible intersection (or union) of the tooling.

I ran out of time to finish up some of what I had on this topic this
week, but figured (especially since I'd promised to get it done this
week) to send what I have now for discussion.

Things missing & reader's notes:

 * I had some amendmends to the protocol I meant to distill further
   into the protocol docs at [5]. Basically omitting the ability to
   transmit key-values and to have it just be a list of URIs with an
   optional <header> for each one, which is purely a server-to-client
   aid (i.e. those headers will be what you'll find in the pointed-to
   bundles).

* This series goes up to "clone", but I also have incremental fetch
  patches. I ran into an (easily solvable bug) in that & thought it
  was best to omit it for now. It'll be here soon.

  Basically for incremental fetch we'll in parallel get the headers
  for remote bundles, and then do an early abort of those downloads
  that we deem that we don't need.

  Clever (but all standard & boring) use of HTTP caching headers
  between client & servers then allows the client to not request the
  same thing again and again. I.e. want less server load on your CDN?
  Just have the bundles be unique URLs and set appropriate caching
  headers.

* 13/13 demonstrates some of those ideas at a high level.

* A problem with this implementation (and Derrick's, I believe) is
  that it keeps a server waiting while we twiddle our thumbs and wget
  (or equivalent) the bundle(s) locally. If you e.g. clone
  "chromium.git" the server will get tired of waiting, drop the
  connection, and unless the bundle is 100% up-to-date the "clone"
  will fail.

  The solution to this is to get the bundle headers in parallel, and
  as soon as we've got them present the OIDs in the headers as "HAVE"
  to the server, which'll then send us an incremental PACK (and even
  possibly a packfile-uri) for the difference between those bundle(s)
  and what its tips are.

  We can then simply disconnect, download/index-pack everything, and
  do a connectivity checkat the end.

  This requires some careful error handling, e.g. if the resulting
  repo isn't valid we'll need to retry, and the current negotiation
  API isn't very happy to negotiate on the basis of OIDs we don't have
  in the object store (but the protocol handles it just fine).

  So while I have local patches for (some of) that I opted to leave it
  out for now. Once we have incremental fetch it's *mostly* an
  optimization anyway (except in cases of e.g. chromium.git), so for
  the initial protocol etc. discussion it's probably best to leave it
  out.

 * This currently fails CI on Windows & 32 bit Linux for the most
   trivial of reasons, need to adjust printf formats to use PRIuMAX
   etc., but I ran out of time, sorry.

1. https://lore.kernel.org/git/ddebc223-1e13-e758-f9b1-d3f23961e459@xxxxxxxxxx/
2. https://lore.kernel.org/git/patch-3.3-64224ec2cba-20211025T211159Z-avarab@xxxxxxxxx/
3. https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@xxxxxxxxx/
4. https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@xxxxxxxxx/
5. https://lore.kernel.org/git/211027.86ilxixoxz.gmgdl@xxxxxxxxxxxxxxxxxxx/

Ævar Arnfjörð Bjarmason (13):
  protocol v2: add server-side "bundle-uri" skeleton
  bundle-uri docs: add design notes
  bundle-uri client: add "bundle-uri" parsing + tests
  connect.c: refactor sending of agent & object-format
  bundle-uri client: add minimal NOOP client
  bundle-uri client: add "git ls-remote-bundle-uri"
  bundle-uri client: add transfer.injectBundleURI support
  bundle-uri client: add boolean transfer.bundleURI setting
  fetch-pack: add a deref_without_lazy_fetch_extended()
  fetch-pack: move --keep=* option filling to a function
  bundle.h: make "fd" version of read_bundle_header() public
  bundle-uri client: support for bundle-uri with "clone"
  bundle-uri: make the download program configurable

 Documentation/config/transfer.txt          |  33 ++
 Documentation/git-ls-remote-bundle-uri.txt |  62 +++
 Documentation/git-ls-remote.txt            |   1 +
 Documentation/technical/bundle-uri.txt     | 119 ++++++
 Documentation/technical/protocol-v2.txt    | 214 +++++++++++
 Makefile                                   |   3 +
 builtin.h                                  |   1 +
 builtin/clone.c                            |   7 +
 builtin/ls-remote-bundle-uri.c             |  90 +++++
 bundle-uri.c                               | 183 +++++++++
 bundle-uri.h                               |  29 ++
 bundle.c                                   |   8 +-
 bundle.h                                   |   2 +
 command-list.txt                           |   1 +
 connect.c                                  |  80 +++-
 fetch-pack.c                               | 306 ++++++++++++++-
 fetch-pack.h                               |   6 +
 git.c                                      |   1 +
 remote.h                                   |   4 +
 serve.c                                    |   6 +
 t/helper/test-bundle-uri.c                 |  83 ++++
 t/helper/test-tool.c                       |   1 +
 t/helper/test-tool.h                       |   1 +
 t/lib-t5730-protocol-v2-bundle-uri.sh      | 424 +++++++++++++++++++++
 t/t5701-git-serve.sh                       | 124 +++++-
 t/t5730-protocol-v2-bundle-uri-file.sh     |  36 ++
 t/t5731-protocol-v2-bundle-uri-git.sh      |  17 +
 t/t5732-protocol-v2-bundle-uri-http.sh     |  17 +
 t/t5750-bundle-uri-parse.sh                | 153 ++++++++
 transport-helper.c                         |  13 +
 transport-internal.h                       |   7 +
 transport.c                                | 120 ++++++
 transport.h                                |  22 ++
 33 files changed, 2141 insertions(+), 33 deletions(-)
 create mode 100644 Documentation/git-ls-remote-bundle-uri.txt
 create mode 100644 Documentation/technical/bundle-uri.txt
 create mode 100644 builtin/ls-remote-bundle-uri.c
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100644 t/lib-t5730-protocol-v2-bundle-uri.sh
 create mode 100755 t/t5730-protocol-v2-bundle-uri-file.sh
 create mode 100755 t/t5731-protocol-v2-bundle-uri-git.sh
 create mode 100755 t/t5732-protocol-v2-bundle-uri-http.sh
 create mode 100755 t/t5750-bundle-uri-parse.sh

Range-diff against v1:
 2:  3ac0539c053 !  1:  2fc87ce092b protocol v2: specify static seeding of clone/fetch via "bundle-uri"
    @@ Metadata
     Author: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
     
      ## Commit message ##
    -    protocol v2: specify static seeding of clone/fetch via "bundle-uri"
    +    protocol v2: add server-side "bundle-uri" skeleton
     
    -    Add a server-side implementation of a new "bundle-uri" command to
    -    protocol v2. As discussed in the updated "protocol-v2.txt" this will
    -    allow conforming clients to optionally seed their initial clones or
    -    incremental fetches from URLs containing "*.bundle" files created with
    -    "git bundle create".
    +    Add a skeleton server-side implementation of a new "bundle-uri"
    +    command to protocol v2. This will allow conforming clients to
    +    optionally seed their initial clones or incremental fetches from URLs
    +    containing "*.bundle" files created with "git bundle create".
     
         The use-cases are similar to those of the existing "Packfile URIs",
         and the two feature can be combined within a single request, but
         "bundle-uri" has a few advantages over packfile-uris in some some
         common scenarios, discussed below.
     
    -    This change does not give us a working "bundle-uri" client. I have
    -    those patches as a follow-up, but let's first establish what the
    -    protocol for this should be like first. The client implementation will
    -    then implement this specification.
    +    This change does not give us a working "bundle-uri" client, subsequent
    +    commits will do that. Let's first establish what the protocol for this
    +    should be like first. The client implementation will then implement
    +    this specification.
     
         With this change when the uploadpack.bundleURI config is set to a
         URI (or URIs, if set >1 times), advertise a "bundle-uri" command. Then
    @@ Commit message
         .gitmodules check in that context. See [6] for the "ls-refs unborn"
         feature which modified code in similar areas of the request flow.
     
    +    Finally, there's currently a concurrent (submitted after the v1 of
    +    this commit, but before the subsequent client parts of this
    +    implementation) RFC of a somewhat similar "bundle-uri" facility at
    +    [7].
    +
         1. https://lore.kernel.org/git/20111110074330.GA27925@xxxxxxxxxxxxxxxxxxxxx/
         2. https://lore.kernel.org/git/20190514092900.GA11679@xxxxxxxxxxxxxxxxxxxxx/
         3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@xxxxxxxxxxxxxxxxxxxxxxx/
    @@ Commit message
            Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs',
            2021-03-01)
         6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17)
    +    7. https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@xxxxxxxxx/
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
     
    @@ bundle-uri.h (new)
     @@
     +#ifndef BUNDLE_URI_H
     +#define BUNDLE_URI_H
    -+
    -+struct repository;
    -+struct packet_reader;
    -+struct packet_writer;
    ++#include "repository.h"
    ++#include "pkt-line.h"
    ++#include "strbuf.h"
     +
     +/**
     + * API used by serve.[ch].
 -:  ----------- >  2:  84c4036a510 bundle-uri docs: add design notes
 3:  64224ec2cba !  3:  3abfb2290fd bundle-uri client: add "bundle-uri" parsing + tests
    @@ bundle-uri.c: int bundle_uri_command(struct repository *r,
     
      ## bundle-uri.h ##
     @@
    - struct repository;
    - struct packet_reader;
    - struct packet_writer;
    -+struct string_list;
    + #include "repository.h"
    + #include "pkt-line.h"
    + #include "strbuf.h"
    ++#include "string-list.h"
      
      /**
       * API used by serve.[ch].
    -@@ bundle-uri.h: struct packet_writer;
    +@@
      int bundle_uri_advertise(struct repository *r, struct strbuf *value);
      int bundle_uri_command(struct repository *r, struct packet_reader *request);
      
 1:  7639b9bbac5 !  4:  f64aefa9ece leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak
    @@ Metadata
     Author: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
     
      ## Commit message ##
    -    leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak
    +    connect.c: refactor sending of agent & object-format
     
    -    The "t5701-git-serve.sh" test passes when run under a git compiled
    -    with SANITIZE=leak, let's mark it as such to add it to the
    -    "linux-leaks" CI job.
    +    Refactor the sending of the "agent" and "object-format" capabilities
    +    into a function.
    +
    +    This was added in its current form in ab67235bc4 (connect: parse v2
    +    refs with correct hash algorithm, 2020-05-25). When we connect to a v2
    +    server we need to know about its object-format, and it needs to know
    +    about ours. Since most things in connect.c and transport.c piggy-back
    +    on the eager getting of remote refs via the handshake() those commands
    +    can make use of the just-sent-over object-format by ls-refs.
    +
    +    But I'm about to add a command that may come after ls-refs, and may
    +    not, but we need the server to know about our user-agent and
    +    object-format. So let's split this into a function.
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
     
    - ## t/t5701-git-serve.sh ##
    -@@ t/t5701-git-serve.sh: test_description='test protocol v2 server commands'
    - GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
    - export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
    + ## connect.c ##
    +@@ connect.c: void check_stateless_delimiter(int stateless_rpc,
    + 		die("%s", error);
    + }
    + 
    ++static void send_capabilities(int fd_out, struct packet_reader *reader)
    ++{
    ++	const char *hash_name;
    ++
    ++	if (server_supports_v2("agent", 0))
    ++		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
    ++
    ++	if (server_feature_v2("object-format", &hash_name)) {
    ++		int hash_algo = hash_algo_by_name(hash_name);
    ++		if (hash_algo == GIT_HASH_UNKNOWN)
    ++			die(_("unknown object format '%s' specified by server"), hash_name);
    ++		reader->hash_algo = &hash_algos[hash_algo];
    ++		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
    ++	} else {
    ++		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
    ++	}
    ++}
    ++
    + struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
    + 			     struct ref **list, int for_push,
    + 			     struct transport_ls_refs_options *transport_options,
    +@@ connect.c: struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
    + 			     int stateless_rpc)
    + {
    + 	int i;
    +-	const char *hash_name;
    + 	struct strvec *ref_prefixes = transport_options ?
    + 		&transport_options->ref_prefixes : NULL;
    + 	const char **unborn_head_target = transport_options ?
    +@@ connect.c: struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
    + 	if (server_supports_v2("ls-refs", 1))
    + 		packet_write_fmt(fd_out, "command=ls-refs\n");
      
    -+TEST_PASSES_SANITIZE_LEAK=true
    - . ./test-lib.sh
    +-	if (server_supports_v2("agent", 0))
    +-		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
    +-
    +-	if (server_feature_v2("object-format", &hash_name)) {
    +-		int hash_algo = hash_algo_by_name(hash_name);
    +-		if (hash_algo == GIT_HASH_UNKNOWN)
    +-			die(_("unknown object format '%s' specified by server"), hash_name);
    +-		reader->hash_algo = &hash_algos[hash_algo];
    +-		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
    +-	} else {
    +-		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
    +-	}
    ++	/* Send capabilities */
    ++	send_capabilities(fd_out, reader);
      
    - test_expect_success 'test capability advertisement' '
    + 	if (server_options && server_options->nr &&
    + 	    server_supports_v2("server-option", 1))
 -:  ----------- >  5:  105ced66409 bundle-uri client: add minimal NOOP client
 -:  ----------- >  6:  617a6b16df8 bundle-uri client: add "git ls-remote-bundle-uri"
 -:  ----------- >  7:  b0ce379528e bundle-uri client: add transfer.injectBundleURI support
 -:  ----------- >  8:  44d96a0f5f8 bundle-uri client: add boolean transfer.bundleURI setting
 -:  ----------- >  9:  d9f5b486511 fetch-pack: add a deref_without_lazy_fetch_extended()
 -:  ----------- > 10:  31a22eb3bd4 fetch-pack: move --keep=* option filling to a function
 -:  ----------- > 11:  5ade9419454 bundle.h: make "fd" version of read_bundle_header() public
 -:  ----------- > 12:  bb0d681f5f0 bundle-uri client: support for bundle-uri with "clone"
 -:  ----------- > 13:  40f37c8b9d5 bundle-uri: make the download program configurable
-- 
2.35.1.1337.g7e32d794afe




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux