This is based on v4 of ds/path-walk-1 [1] and an earlier version was part of my initial path-walk RFC [2]. [1] https://lore.kernel.org/git/pull.1818.v3.git.1733514358.gitgitgadget@xxxxxxxxx/ [2] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@xxxxxxxxx/ This series adds a new 'git backfill' command that uses the path-walk API to download missing blobs in a blobless partial clone. Users can specify interaction with the sparse-checkout using '--[no-]sparse' but the '--sparse' option is implied by the existence of a sparse-checkout. The reason to use the path-walk API is to make sure that the missing objects are grouped by a common path, giving a reasonable process for batching requests and expecting the server to compress the resulting packfile nicely together. I first prototyped this feature in June 2024 as an exploration and created the path-walk algorithm for this purpose. It was only my intuition that led me to believe that batching by path would lead to better packfiles. This has been proven out as a very important feature due to recent investigations to compressing full repositories by doing a better job of grouping objects by path. See the --name-hash-version series [3] or the 'git pack-objects --path-walk' series [4] (currently on hold as it conflicts with the --name-hash-version series). [3] https://lore.kernel.org/git/pull.1823.v2.git.1733181682.gitgitgadget@xxxxxxxxx/ [4] https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@xxxxxxxxx/ This idea can be further demonstrated by the evidence in testing this feature: by downloading objects in small batch sizes, the client can force the server to repack things more efficiently than a full repack. The example repository I have used in multiple places is the microsoft/fluentui repo [5] as it has many CHANGELOG.md files that cause name hash collisions that make the full repack inefficient. [5] https://github.com/microsoft/fluentui If we create a blobless clone of the fluentui repo, then this downloads 105 MB across two packfiles (the commits and trees pack, followed by the blobs needed for an initial checkout). Running 'git backfill --batch-size=' for different sizes leads to some interesting results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|--------| | (Initial clone) | 2 | 105 MB | | | 5K | 53 | 348 MB | 2m 26s | | 10K | 28 | 365 MB | 2m 22s | | 15K | 19 | 407 MB | 2m 21s | | 20K | 15 | 393 MB | 2m 28s | | 25K | 13 | 417 MB | 2m 06s | | 50K | 8 | 509 MB | 1m 34s | | 100K | 5 | 535 MB | 1m 56s | | 250K | 4 | 698 MB | 1m 33s | | 500K | 3 | 696 MB | 1m 42s | The smaller batches cause the server to realize that the existing deltas cannot be reused and it finds better deltas. This takes some extra time for the small batches, but halves the size of the repo. Even in the 500K batch size, we get less data than the 738 MB of a full clone. Implementing the --sparse feature is best done by augmenting the path-walk API to be aware of a pattern list. This works for both cone and non-cone mode sparse-checkouts. There are future directions we could take this command, especially to run the command with a user-specified pathspec. The tricky case for that additional feature is trying to make the path-walk more efficient by skipping tree paths that would not lead to a match of the pathspec. It would likely need optimization in a small subset of pathspec features (such as prefix matches) to work as efficiently as possible. I did prototype a version that puts the pathspec match in the callback function within builtin/backfill.c, but I found that uninspiring and unnecessary for now. Updates in v2 ============= Thanks for the review on v1. * Memory leaks should be plugged now due to the introduction of a destructor in v4 of the path-walk API and its extension here to clear the pattern list. * Documentation is expanded to demonstrate that 'git backfill' can also approximate incremental/resumable clones of repositories. * Method renames to better match style. * --batch-size is renamed to --min-batch-size for clarity and future extensibility. Thanks, -Stolee Derrick Stolee (5): backfill: add builtin boilerplate backfill: basic functionality and tests backfill: add --min-batch-size=<n> option backfill: add --sparse option backfill: assume --sparse when sparse-checkout is enabled .gitignore | 1 + Documentation/git-backfill.txt | 68 +++++++++ Documentation/technical/api-path-walk.txt | 11 +- Makefile | 1 + builtin.h | 1 + builtin/backfill.c | 151 ++++++++++++++++++ command-list.txt | 1 + dir.c | 10 +- dir.h | 3 + git.c | 1 + path-walk.c | 28 +++- path-walk.h | 11 ++ t/helper/test-path-walk.c | 22 ++- t/t5620-backfill.sh | 178 ++++++++++++++++++++++ t/t6601-path-walk.sh | 32 ++++ 15 files changed, 505 insertions(+), 14 deletions(-) create mode 100644 Documentation/git-backfill.txt create mode 100644 builtin/backfill.c create mode 100755 t/t5620-backfill.sh base-commit: ef543429ed9cb92e209900e7c7fc4f8e0698a306 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1820%2Fderrickstolee%2Fbackfill-upstream-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1820/derrickstolee/backfill-upstream-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/1820 Range-diff vs v1: 1: 0300aa1b8c3 = 1: ac86510417a backfill: add builtin boilerplate 2: 5728dd27021 ! 2: e4e88794cae backfill: basic functionality and tests @@ Documentation/git-backfill.txt: SYNOPSIS +grouping blobs that appear at the same path, hopefully leading to good +delta compression in the packfile sent by the server. + ++In this way, `git backfill` provides a mechanism to break a large clone ++into smaller chunks. Starting with a blobless partial clone with `git ++clone --filter=blob:none` and then running `git backfill` in the local ++repository provides a way to download all reachable objects in several ++smaller network calls than downloading the entire repository at clone ++time. ++ +By default, `git backfill` downloads all blobs reachable from the `HEAD` +commit. This set can be restricted or expanded using various options. + @@ builtin/backfill.c + size_t batch_size; +}; + -+static void clear_backfill_context(struct backfill_context *ctx) ++static void backfill_context_clear(struct backfill_context *ctx) +{ + oid_array_clear(&ctx->current_batch); +} @@ builtin/backfill.c + if (!ret) + download_batch(ctx); + -+ clear_backfill_context(ctx); ++ backfill_context_clear(ctx); ++ path_walk_info_clear(&info); ++ release_revisions(&revs); + return ret; +} + 3: 3cfd23073a0 ! 3: 3fa32822dab backfill: add --batch-size=<n> option @@ Metadata Author: Derrick Stolee <derrickstolee@xxxxxxxxxx> ## Commit message ## - backfill: add --batch-size=<n> option + backfill: add --min-batch-size=<n> option Users may want to specify a minimum batch size for their needs. This is only a minimum: the path-walk API provides a list of OIDs that correspond to the @@ Documentation/git-backfill.txt: git-backfill - Download missing objects in a par DESCRIPTION ----------- -@@ Documentation/git-backfill.txt: delta compression in the packfile sent by the server. +@@ Documentation/git-backfill.txt: time. By default, `git backfill` downloads all blobs reachable from the `HEAD` commit. This set can be restricted or expanded using various options. +OPTIONS +------- + -+--batch-size=<n>:: ++--min-batch-size=<n>:: + Specify a minimum size for a batch of missing objects to request + from the server. This size may be exceeded by the last set of -+ blobs seen at a given path. Default batch size is 16,000. ++ blobs seen at a given path. The default minimum batch size is ++ 50,000. + SEE ALSO -------- @@ builtin/backfill.c NULL }; + struct backfill_context { + struct repository *repo; + struct oid_array current_batch; +- size_t batch_size; ++ size_t min_batch_size; + }; + + static void backfill_context_clear(struct backfill_context *ctx) +@@ builtin/backfill.c: static int fill_missing_blobs(const char *path UNUSED, + oid_array_append(&ctx->current_batch, &list->oid[i]); + } + +- if (ctx->current_batch.nr >= ctx->batch_size) ++ if (ctx->current_batch.nr >= ctx->min_batch_size) + download_batch(ctx); + + return 0; @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit - .batch_size = 50000, + struct backfill_context ctx = { + .repo = repo, + .current_batch = OID_ARRAY_INIT, +- .batch_size = 50000, ++ .min_batch_size = 50000, }; struct option options[] = { -+ OPT_INTEGER(0, "batch-size", &ctx.batch_size, -+ N_("Minimun number of objects to request at a time")), ++ OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size, ++ N_("Minimum number of objects to request at a time")), OPT_END(), }; @@ t/t5620-backfill.sh: test_expect_success 'do partial clone 1, backfill gets all test_line_count = 0 revs2 ' -+test_expect_success 'do partial clone 2, backfill batch size' ' ++test_expect_success 'do partial clone 2, backfill min batch size' ' + git clone --no-checkout --filter=blob:none \ + --single-branch --branch=main \ + "file://$(pwd)/srv.bare" backfill2 && + + GIT_TRACE2_EVENT="$(pwd)/batch-trace" git \ -+ -C backfill2 backfill --batch-size=20 && ++ -C backfill2 backfill --min-batch-size=20 && + + # Batches were used + test_trace2_data promisor fetch_count 20 <batch-trace >matches && 4: 19a8efebbad ! 4: 2723143afb3 backfill: add --sparse option @@ Commit message only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. - However, history investigations can be expensie as computing blob diffs will - trigger promisor remote requests for one object at a time. This can be + However, history investigations can be expensive as computing blob diffs + will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. @@ Documentation/git-backfill.txt: git-backfill - Download missing objects in a par DESCRIPTION ----------- @@ Documentation/git-backfill.txt: OPTIONS - from the server. This size may be exceeded by the last set of - blobs seen at a given path. Default batch size is 16,000. + blobs seen at a given path. The default minimum batch size is + 50,000. +--[no-]sparse:: + Only download objects if they appear at a path that matches the @@ builtin/backfill.c @@ builtin/backfill.c: struct backfill_context { struct repository *repo; struct oid_array current_batch; - size_t batch_size; + size_t min_batch_size; + int sparse; }; - static void clear_backfill_context(struct backfill_context *ctx) + static void backfill_context_clear(struct backfill_context *ctx) @@ builtin/backfill.c: static int do_backfill(struct backfill_context *ctx) struct path_walk_info info = PATH_WALK_INFO_INIT; int ret; + if (ctx->sparse) { + CALLOC_ARRAY(info.pl, 1); -+ if (get_sparse_checkout_patterns(info.pl)) ++ if (get_sparse_checkout_patterns(info.pl)) { ++ path_walk_info_clear(&info); + return error(_("problem loading sparse-checkout")); ++ } + } + repo_init_revisions(ctx->repo, &revs, ""); @@ builtin/backfill.c: static int do_backfill(struct backfill_context *ctx) @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit .repo = repo, .current_batch = OID_ARRAY_INIT, - .batch_size = 50000, + .min_batch_size = 50000, + .sparse = 0, }; struct option options[] = { - OPT_INTEGER(0, "batch-size", &ctx.batch_size, - N_("Minimun number of objects to request at a time")), + OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size, + N_("Minimum number of objects to request at a time")), + OPT_BOOL(0, "sparse", &ctx.sparse, + N_("Restrict the missing objects to the current sparse-checkout")), OPT_END(), @@ path-walk.c #include "revision.h" #include "string-list.h" #include "strmap.h" -@@ path-walk.c: static int add_children(struct path_walk_context *ctx, +@@ path-walk.c: static int add_tree_entries(struct path_walk_context *ctx, if (type == OBJ_TREE) strbuf_addch(&path, '/'); @@ path-walk.c: static int add_children(struct path_walk_context *ctx, if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) { CALLOC_ARRAY(list, 1); list->type = type; +@@ path-walk.c: void path_walk_info_init(struct path_walk_info *info) + memcpy(info, &empty, sizeof(empty)); + } + +-void path_walk_info_clear(struct path_walk_info *info UNUSED) ++void path_walk_info_clear(struct path_walk_info *info) + { +- /* +- * This destructor is empty for now, as info->revs +- * is not owned by 'struct path_walk_info'. +- */ ++ if (info->pl) { ++ clear_pattern_list(info->pl); ++ free(info->pl); ++ } + } ## path-walk.h ## @@ @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv) } ## t/t5620-backfill.sh ## -@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill batch size' ' +@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch size' ' test_line_count = 0 revs2 ' 5: 35a7e4ec4d3 ! 5: 1f765409eaf backfill: assume --sparse when sparse-checkout is enabled @@ builtin/backfill.c: int cmd_backfill(int argc, const char **argv, const char *pr } ## t/t5620-backfill.sh ## -@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill batch size' ' +@@ t/t5620-backfill.sh: test_expect_success 'do partial clone 2, backfill min batch size' ' test_line_count = 0 revs2 ' -- gitgitgadget