On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote: > diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt > index 640144187d3..0e10f066fef 100644 > --- a/Documentation/git-backfill.txt > +++ b/Documentation/git-backfill.txt > @@ -14,6 +14,30 @@ SYNOPSIS > DESCRIPTION > ----------- > > +Blobless partial clones are created using `git clone --filter=blob:none` > +and then configure the local repository such that the Git client avoids > +downloading blob objects unless they are required for a local operation. > +This initially means that the clone and later fetches download reachable > +commits and trees but no blobs. Later operations that change the `HEAD` > +pointer, such as `git checkout` or `git merge`, may need to download > +missing blobs in order to complete their operation. Okay. > +In the worst cases, commands that compute blob diffs, such as `git blame`, > +become very slow as they download the missing blobs in single-blob > +requests to satisfy the missing object as the Git command needs it. This > +leads to multiple download requests and no ability for the Git server to > +provide delta compression across those objects. > + > +The `git backfill` command provides a way for the user to request that > +Git downloads the missing blobs (with optional filters) such that the > +missing blobs representing historical versions of files can be downloaded > +in batches. The `backfill` command attempts to optimize the request by > +grouping blobs that appear at the same path, hopefully leading to good > +delta compression in the packfile sent by the server. Hm. So we're asking the user to fix a usability issue of git-blame(1), don't we? Ideally, git-blame(1) itself should know to transparently batch the blobs it requires to compute its output, shouldn't it? That usecase alone doesn't yet convince me that git-backfill(1) is a good idea as I'd think we should rather fix the underlying issue. So are there other usecases for git-backfill(1)? I can imagine that it might be helpful in the context of scripts that know they'll operate on a bunch of blobs. > diff --git a/builtin/backfill.c b/builtin/backfill.c > index 38e6aaeaa03..e5f2000d5e0 100644 > --- a/builtin/backfill.c > +++ b/builtin/backfill.c > @@ -1,16 +1,116 @@ > #include "builtin.h" > +#include "git-compat-util.h" > #include "config.h" > #include "parse-options.h" > #include "repository.h" > +#include "commit.h" > +#include "hex.h" > +#include "tree.h" > +#include "tree-walk.h" > #include "object.h" > +#include "object-store-ll.h" > +#include "oid-array.h" > +#include "oidset.h" > +#include "promisor-remote.h" > +#include "strmap.h" > +#include "string-list.h" > +#include "revision.h" > +#include "trace2.h" > +#include "progress.h" > +#include "packfile.h" > +#include "path-walk.h" > > static const char * const builtin_backfill_usage[] = { > N_("git backfill [<options>]"), > NULL > }; > > +struct backfill_context { > + struct repository *repo; > + struct oid_array current_batch; > + size_t batch_size; > +}; > + > +static void clear_backfill_context(struct backfill_context *ctx) > +{ > + oid_array_clear(&ctx->current_batch); > +} Nit: our style guide says that this should rather be `backfill_context_clear()`. > +static void download_batch(struct backfill_context *ctx) > +{ > + promisor_remote_get_direct(ctx->repo, > + ctx->current_batch.oid, > + ctx->current_batch.nr); > + oid_array_clear(&ctx->current_batch); > + > + /* > + * We likely have a new packfile. Add it to the packed list to > + * avoid possible duplicate downloads of the same objects. > + */ > + reprepare_packed_git(ctx->repo); > +} > + > +static int fill_missing_blobs(const char *path UNUSED, > + struct oid_array *list, > + enum object_type type, > + void *data) > +{ > + struct backfill_context *ctx = data; > + > + if (type != OBJ_BLOB) > + return 0; > + > + for (size_t i = 0; i < list->nr; i++) { > + off_t size = 0; > + struct object_info info = OBJECT_INFO_INIT; > + info.disk_sizep = &size; > + if (oid_object_info_extended(ctx->repo, > + &list->oid[i], > + &info, > + OBJECT_INFO_FOR_PREFETCH) || > + !size) > + oid_array_append(&ctx->current_batch, &list->oid[i]); > + } > + > + if (ctx->current_batch.nr >= ctx->batch_size) > + download_batch(ctx); Okay, so the batch size is just "best effort". If we walk a tree that makes us exceed the batch size then we wouldn't issue a fetch during the tree walk. Is there any specific reason for this behaviour? In any case, as long as this is properly documented I think this should be fine in general. > + return 0; > +} > + > +static int do_backfill(struct backfill_context *ctx) > +{ > + struct rev_info revs; > + struct path_walk_info info = PATH_WALK_INFO_INIT; > + int ret; > + > + repo_init_revisions(ctx->repo, &revs, ""); > + handle_revision_arg("HEAD", &revs, 0, 0); > + > + info.blobs = 1; > + info.tags = info.commits = info.trees = 0; > + > + info.revs = &revs; > + info.path_fn = fill_missing_blobs; > + info.path_fn_data = ctx; > + > + ret = walk_objects_by_path(&info); > + > + /* Download the objects that did not fill a batch. */ > + if (!ret) > + download_batch(ctx); > + > + clear_backfill_context(ctx); Are we leaking `revs` and `info`? > + return ret; > +} > + > int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo) > { > + struct backfill_context ctx = { > + .repo = repo, > + .current_batch = OID_ARRAY_INIT, > + .batch_size = 50000, > + }; > struct option options[] = { > OPT_END(), > }; > @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit > > repo_config(repo, git_default_config, NULL); > > - die(_("not implemented")); > - > - return 0; > + return do_backfill(&ctx); > } The current iteration only backfills blobs as far as I can see. Do we maybe want to keep the door open for future changes in git-backfill(1) by implementing this via a "blob" subcommand? Patrick