Re: [PATCH v2 08/18] maintenance: add prefetch task

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 24 Jul 2020 10:25:52 -0400

On 7/23/2020 4:53 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
>> index 9204762e21..0927643247 100644
>> --- a/Documentation/git-maintenance.txt
>> +++ b/Documentation/git-maintenance.txt
>> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>>  `commit-graph-chain` file. They will be deleted by a later run based on
>>  the expiration delay.
>>  
>> +prefetch::
>> +	The `fetch` task updates the object directory with the latest objects
> 
> s/fetch/prefetch/ most likely.
> 
>> +	from all registered remotes. For each remote, a `git fetch` command
>> +	is run. The refmap is custom to avoid updating local or remote
> 
> s/remote/remote-tracking/ definitely.  Do not forget the hyphen
> between the two words.
> 
> I think it made the above unnecessarily confusing that you ended a
> sentence after "is run".  It gives a wrong impression that you'd be
> doing a "real fetch", which you need to dispel with a follow up
> description of the refmap.
> 
> 	For each remote, a `git fetch` command is run with a refspec
> 	to fetch their branches (those in their `refs/heads`) into
> 	our `refs/prefetch/<remote>/` hierarchy and without auto
> 	following tags (the configured refspec in the repository is
> 	ignored).
> 
>> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
>> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
>> +	not updated.
>> ++
>> +This means that foreground fetches are still required to update the
>> +remote refs, but the users is notified when the branches and tags are
> 
> s/is notified/are notified/???
> 
>> +updated on the remote.
> 
> Often, when one needs to say "X.  This means Y.", X is a suboptimal
> way to explain what needs to be conveyed to the readers.  But this
> is not such a case.  Unlike the "This means" that is often an
> attempt to rephrasing a poor explanation given first, this gives an
> implication.
> 
> But let's not start with a negative impression (i.e. even with
> prefetch, I still have to do X?  What's the point???), but let them
> feel why it is a good thing.  Perhaps (continuing my previous
> rewrite):
> 
> 	This is done to avoid disrupting the remote-tracking
> 	branches---the end users expect them to stay unmoved unless
> 	they initiate a fetch.  With prefetch task, however, the
> 	objects necessary to complete a later real fetch would
> 	already be obtained, so the real fetch would go faster.  In
> 	the ideal case, it will just become an update to bunch of
> 	remote-tracking branches without any object transfer.
> 
> or something like that?  

I like this clarification and have adapted it with minimal edits.

>> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
>> +			 "--no-tags", "--refmap=", NULL);
>> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
>> +	argv_array_push(&cmd, refmap.buf);
> 
> The command line looks somewhat fishy, but I think it is correct.
> At first glance it looks like a mistake to pass "--refmap=" and the
> refspec "+refs/heads/*:refs/prefetch/origin/*" as separate arguments,
> but I think that is exactly what you want here, i.e.
> 
>  - defeat any refspec in the configuration with "--refmap=<empty>"
> 
>  - give explicit refspec "+refs/heads/*:...", with "--no-tags" to
>    decline auto-following, to tell what exactly are to be fetched
>    and stored where.
> 
> The description in the log message about refmap needs to be
> clarified, though (I've already done so in the above suggested
> rewrite).

I could have made your life easier by referring to b40a50264ac
(fetch: document and test --refmap="", 2020-01-21) in my commit
message. It includes this sentence in Documentation/fetch-options.txt:

  Providing an empty `<refspec>` to the `--refmap` option causes
  Git to ignore the configured refspecs and rely entirely on the
  refspecs supplied as command-line arguments.

>> +static int maintenance_task_prefetch(void)
>> +{
>> +	int result = 0;
>> +	struct string_list_item *item;
>> +	struct string_list remotes = STRING_LIST_INIT_DUP;
>> +
>> +	if (for_each_remote(fill_each_remote, &remotes)) {
>> +		error(_("failed to fill remotes"));
>> +		result = 1;
>> +		goto cleanup;
>> +	}
>> +
>> +	/*
>> +	 * Do not modify the result based on the success of the 'fetch'
>> +	 * operation, as a loss of network could cause 'fetch' to fail
>> +	 * quickly. We do not want that to stop the rest of our
>> +	 * background operations.
>> +	 */
> 
> The loop that runs different tasks abort at the first failure,
> though.  Perhaps that loop needs to be rethought as well?

You're right. These maintenance tasks are intended to be
independent of each other, so let's try all of them and
report a failure after all have been given an opportunity
to run. That makes this failure behavior unnecessary.

>> +	for (item = remotes.items;
>> +	     item && item < remotes.items + remotes.nr;
>> +	     item++)
>> +		fetch_remote(item->string);
>> +
>> +cleanup:
>> +	string_list_clear(&remotes, 0);
>> +	return result;
>> +}
>> +
>>  static int maintenance_task_gc(void)
>>  {
>>  	int result;
>> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>>  
>> +	tasks[num_tasks]->name = "prefetch";
>> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
>> +	num_tasks++;
>> +
>>  	tasks[num_tasks]->name = "gc";
>>  	tasks[num_tasks]->fn = maintenance_task_gc;
>>  	tasks[num_tasks]->enabled = 1;
> 
> Two things.
> 
>  - As I said upfront, I do not see the point of preparing the table
>    with code.
> 
>  - The reason why prefetch is placed in front is probably because
>    you do not want to repack before you add more objects to the
>    object store.  But doesn't that imply that there is an inherent
>    ordering that we, as those who are more expert on Git than the
>    end users, prefer?  Is it a wise decision to let the users affect
>    the order of the tasks run by giving command line options in
>    different order in the previous step?

I don't anticipate users specifying --task=<task> very often, as
it requires deep knowledge of the tasks. If a user _does_ use the
option, then we should trust their order as they might have a
good reason to choose that order.

Generally, my philosophy is to provide expert users with flexible
choices while creating sensible defaults for non-expert users.

That said, if this sort is more of a problem than the value it
provides, then I can drop the sort and the duplicate --task=<task>
check.

Thanks,
-Stolee