Re: [PATCH v2 08/18] maintenance: add prefetch task

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 23 Jul 2020 13:53:20 -0700

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 9204762e21..0927643247 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +prefetch::
> +	The `fetch` task updates the object directory with the latest objects

s/fetch/prefetch/ most likely.

> +	from all registered remotes. For each remote, a `git fetch` command
> +	is run. The refmap is custom to avoid updating local or remote

s/remote/remote-tracking/ definitely.  Do not forget the hyphen
between the two words.

I think it made the above unnecessarily confusing that you ended a
sentence after "is run".  It gives a wrong impression that you'd be
doing a "real fetch", which you need to dispel with a follow up
description of the refmap.

	For each remote, a `git fetch` command is run with a refspec
	to fetch their branches (those in their `refs/heads`) into
	our `refs/prefetch/<remote>/` hierarchy and without auto
	following tags (the configured refspec in the repository is
	ignored).

> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
> +	not updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are

s/is notified/are notified/???

> +updated on the remote.

Often, when one needs to say "X.  This means Y.", X is a suboptimal
way to explain what needs to be conveyed to the readers.  But this
is not such a case.  Unlike the "This means" that is often an
attempt to rephrasing a poor explanation given first, this gives an
implication.

But let's not start with a negative impression (i.e. even with
prefetch, I still have to do X?  What's the point???), but let them
feel why it is a good thing.  Perhaps (continuing my previous
rewrite):

	This is done to avoid disrupting the remote-tracking
	branches---the end users expect them to stay unmoved unless
	they initiate a fetch.  With prefetch task, however, the
	objects necessary to complete a later real fetch would
	already be obtained, so the real fetch would go faster.  In
	the ideal case, it will just become an update to bunch of
	remote-tracking branches without any object transfer.

or something like that?  

> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);

The command line looks somewhat fishy, but I think it is correct.
At first glance it looks like a mistake to pass "--refmap=" and the
refspec "+refs/heads/*:refs/prefetch/origin/*" as separate arguments,
but I think that is exactly what you want here, i.e.

 - defeat any refspec in the configuration with "--refmap=<empty>"

 - give explicit refspec "+refs/heads/*:...", with "--no-tags" to
   decline auto-following, to tell what exactly are to be fetched
   and stored where.

The description in the log message about refmap needs to be
clarified, though (I've already done so in the above suggested
rewrite).

> +static int maintenance_task_prefetch(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(fill_each_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */

The loop that runs different tasks abort at the first failure,
though.  Perhaps that loop needs to be rethought as well?

> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>  
> +	tasks[num_tasks]->name = "prefetch";
> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;

Two things.

 - As I said upfront, I do not see the point of preparing the table
   with code.

 - The reason why prefetch is placed in front is probably because
   you do not want to repack before you add more objects to the
   object store.  But doesn't that imply that there is an inherent
   ordering that we, as those who are more expert on Git than the
   end users, prefer?  Is it a wise decision to let the users affect
   the order of the tasks run by giving command line options in
   different order in the previous step?