Re: [PATCH v2 08/18] maintenance: add prefetch task

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020-07-23 17:56:30+0000, Derrick Stolee via GitGitGadget <gitgitgadget@xxxxxxxxx> wrote:
> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> The task is called "prefetch" because it is work done in advance
> of a foreground fetch to make that 'git fetch' command much faster.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
> 
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:

Does this job interfere with FETCH_HEAD?
>From my quick test (by applying 01-08 on top of rc1, and messing with t7900),
it looks like yes.

I (and some other people, probably) rely on FETCH_HEAD for our scripts.
Hence, it would be nice to not touch FETCH_HEAD with prefetch job.

Thanks,
-Danh

> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>    the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>    refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/prefetch/<remote> refs that no
>    longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> ---
>  Documentation/git-maintenance.txt | 12 ++++++
>  builtin/gc.c                      | 64 ++++++++++++++++++++++++++++++-
>  t/t7900-maintenance.sh            | 24 ++++++++++++
>  3 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 9204762e21..0927643247 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +prefetch::
> +	The `fetch` task updates the object directory with the latest objects
> +	from all registered remotes. For each remote, a `git fetch` command
> +	is run. The refmap is custom to avoid updating local or remote
> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
> +	not updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
> +
>  gc::
>  	Cleanup unnecessary files and optimize the local repository. "GC"
>  	stands for "garbage collection," but this task performs many
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 5d99b4b805..969c127877 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -28,6 +28,7 @@
>  #include "blob.h"
>  #include "tree.h"
>  #include "promisor-remote.h"
> +#include "remote.h"
>  
>  #define FAILED_RUN "failed to run %s"
>  
> @@ -700,7 +701,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	return 0;
>  }
>  
> -#define MAX_NUM_TASKS 2
> +#define MAX_NUM_TASKS 3
>  
>  static const char * const builtin_maintenance_usage[] = {
>  	N_("git maintenance run [<options>]"),
> @@ -781,6 +782,63 @@ static int maintenance_task_commit_graph(void)
>  	return 1;
>  }
>  
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	if (opts.quiet)
> +		argv_array_push(&cmd, "--quiet");
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_each_remote(struct remote *remote, void *cbdata)
> +{
> +	struct string_list *remotes = (struct string_list *)cbdata;
> +
> +	string_list_append(remotes, remote->name);
> +	return 0;
> +}
> +
> +static int maintenance_task_prefetch(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(fill_each_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>  
> +	tasks[num_tasks]->name = "prefetch";
> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index c09a9eb90b..8b04a04c79 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -44,4 +44,28 @@ test_expect_success 'run --task duplicate' '
>  	test_i18ngrep "cannot be selected multiple times" err
>  '
>  
> +test_expect_success 'run --task=prefetch with no remotes' '
> +	git maintenance run --task=prefetch 2>err &&
> +	test_must_be_empty err
> +'
> +
> +test_expect_success 'prefetch multiple remotes' '
> +	git clone . clone1 &&
> +	git clone . clone2 &&
> +	git remote add remote1 "file://$(pwd)/clone1" &&
> +	git remote add remote2 "file://$(pwd)/clone2" &&
> +	git -C clone1 switch -c one &&
> +	git -C clone2 switch -c two &&
> +	test_commit -C clone1 one &&
> +	test_commit -C clone2 two &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch &&
> +	grep ",\"fetch\",\"remote1\"" run-prefetch.txt &&
> +	grep ",\"fetch\",\"remote2\"" run-prefetch.txt &&
> +	test_path_is_missing .git/refs/remotes &&
> +	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
> +	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&
> +	git log prefetch/remote1/one &&
> +	git log prefetch/remote2/two
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

-- 
Danh



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux