Re: [PATCH v7 1/3] push: add reflog check for "--force-if-includes"

Junio C Hamano <gitster@xxxxxxxxx> · Sat, 26 Sep 2020 16:42:10 -0700

Srinidhi Kaushik <shrinidhi.kaushik@xxxxxxxxx> writes:

> @@ -2252,11 +2263,11 @@ int is_empty_cas(const struct push_cas_option *cas)
>  /*
>   * Look at remote.fetch refspec and see if we have a remote
>   * tracking branch for the refname there.  Fill its current
> - * value in sha1[].
> + * value in sha1[], and as a string.

I think the array being referred to was renamed to oid[] sometime
ago.  "and as a string" makes it sound as if sha1[] gets the value
as 40-hex object name text, but that is not what is being done.

    Fill the name of the remote-tracking branch in *dst_refname,
    and the name of the commit object at tis tip in oid[].

perhaps?

>   * If we cannot do so, return negative to signal an error.
>   */
>  static int remote_tracking(struct remote *remote, const char *refname,
> -			   struct object_id *oid)
> +			   struct object_id *oid, char **dst_refname)
>  {
>  	char *dst;
>  
> @@ -2265,9 +2276,154 @@ static int remote_tracking(struct remote *remote, const char *refname,
>  		return -1; /* no tracking ref for refname at remote */
>  	if (read_ref(dst, oid))
>  		return -1; /* we know what the tracking ref is but we cannot read it */
> +
> +	*dst_refname = dst;
> +	return 0;
> +}
> +
> +/*
> + * The struct "reflog_commit_list" and related helper functions
> + * for list manipulation are used for collecting commits into a
> + * list during reflog traversals in "if_exists_or_grab_until()".

Has the name of that function changed since this comment was
written?

> + */
> +struct reflog_commit_list {
> +	struct commit **items;

Name an array in singular when its primary use is to work on an
element at a time---that will let you say item[4] to call the 4-th
item, instead of items[4] that smells awkward.

An array that is used mostly to pass around a collection as a whole
is easier to think about when given a plural name, though.

> +	size_t nr, alloc;
> +};
> +
> +/* Add a commit to list. */
> +static void add_commit(struct reflog_commit_list *list, struct commit *commit)
> +{
> +	ALLOC_GROW(list->items, list->nr + 1, list->alloc);
> +	list->items[list->nr++] = commit;
> +}
> +
> +/* Free and reset the list. */
> +static void free_reflog_commit_list(struct reflog_commit_list *list)
> +{
> +	FREE_AND_NULL(list->items);
> +	list->nr = list->alloc = 0;
> +}
> +
> +struct check_and_collect_until_cb_data {
> +	struct commit *remote_commit;
> +	struct reflog_commit_list *local_commits;
> +	timestamp_t remote_reflog_timestamp;
> +};
> +
> +/* Get the timestamp of the latest entry. */
> +static int peek_reflog(struct object_id *o_oid, struct object_id *n_oid,
> +		       const char *ident, timestamp_t timestamp,
> +		       int tz, const char *message, void *cb_data)
> +{
> +	timestamp_t *ts = cb_data;
> +	*ts = timestamp;
> +	return 1;
> +}

The idea is to use a callback that immediately says "no more" to
grab the data from the first item in the iteration.  It feels
somewhat awkward but because there is no "give us the Nth entry" API
function, it is the cleanest way we can do this.

> +static int check_and_collect_until(struct object_id *o_oid,
> +				   struct object_id *n_oid,
> +				   const char *ident, timestamp_t timestamp,
> +				   int tz, const char *message, void *cb_data)
> +{
> +	struct commit *commit;
> +	struct check_and_collect_until_cb_data *cb = cb_data;
> +
> +	/*
> +	 * If the reflog entry timestamp is older than the remote ref's
> +	 * latest reflog entry, there is no need to check or collect
> +	 * entries older than this one.
> +	 */
> +	if (timestamp < cb->remote_reflog_timestamp)
> +		return -1;
> +
> +	/* An entry was found. */
> +	if (oideq(n_oid, &cb->remote_commit->object.oid))
> +		return 1;
> +
> +	/* Look-up the commit and append it to the list. */
> +	if ((commit = lookup_commit_reference(the_repository, n_oid)))
> +		add_commit(cb->local_commits, commit);

This is merely a minor naming thing, but if you rename add_commit()
to append_commit(), you probably do not even need the comment before
this statement.

>  	return 0;
>  }
>  
> +#define MERGE_BASES_BATCH_SIZE 8

Hmph.  Do we still need batching?

> +/*
> + * Iterate through the reflog of the local ref to check if there is an entry
> + * for the given remote-tracking ref; runs until the timestamp of an entry is
> + * older than latest timestamp of remote-tracking ref's reflog. Any commits
> + * are that seen along the way are collected into a list to check if the
> + * remote-tracking ref is reachable from any of them.
> + */
> +static int is_reachable_in_reflog(const char *local, const struct ref *remote)
> +{
> +	timestamp_t date;
> +	struct commit *commit;
> +	struct commit **chunk;
> +	struct check_and_collect_until_cb_data cb;
> +	struct reflog_commit_list list = { NULL, 0, 0 };
> +	size_t count = 0, batch_size = 0;
> +	int ret = 0;
> +
> +	commit = lookup_commit_reference(the_repository, &remote->old_oid);
> +	if (!commit)
> +		goto cleanup_return;
> +
> +	/*
> +	 * Get the timestamp from the latest entry
> +	 * of the remote-tracking ref's reflog.
> +	 */
> +	for_each_reflog_ent_reverse(remote->tracking_ref, peek_reflog, &date);
> +
> +	cb.remote_commit = commit;
> +	cb.local_commits = &list;
> +	cb.remote_reflog_timestamp = date;
> +	ret = for_each_reflog_ent_reverse(local, check_and_collect_until, &cb);
> +
> +	/* We found an entry in the reflog. */
> +	if (ret > 0)
> +		goto cleanup_return;

Good.  So '1' from the callback is "we found one, no need to look
further and no need to do merge-base", and '-1' from the callback is
"we looked at all entries that are young enough to matter and we
didn't find exact match".  Makes sense.

> +	/*
> +	 * Check if the remote commit is reachable from any
> +	 * of the commits in the collected list, in batches.
> +	 */

I do not know if batching would help (have you measured it?), but if
we were to batch, it is more common to arrange the loop like this:

	for (chunk = list.items;
             chunk < list.items + list.nr;
	     chunk += size) {
             	size = list.items + list.nr - chunk;
                if (MERGE_BASES_BATCH_SIZE < size)
			size = MERGE_BASES_BATCH_SIZE;
		... use chunk[0..size] ...
		chunk += size;
	}

That is, assume that we can grab everything during this round, and
if that bites off too many, clamp it to the maximum value.  If you
are not comfortable with pointer arithmetic, it is also fine to use
an auxiliary variable 'count', but ...

> +	chunk = list.items;
> +	while (count < list.nr) {
> +		batch_size = MERGE_BASES_BATCH_SIZE;
> +
> +		/* For any leftover entries. */
> +		if ((count + MERGE_BASES_BATCH_SIZE) > list.nr)
> +			batch_size = list.nr - count;
> +
> +		if ((ret = in_merge_bases_many(commit, batch_size, chunk)))
> +			break;
> +
> +		chunk += batch_size;
> +		count += MERGE_BASES_BATCH_SIZE;

... you are risking chunk and count to go out of sync here.

It does not matter within this loop (count will point beyond the end
of list.item[] while chunk will never go past the array), but future
developers can be confused into thinking that they can use chunk and
count interchangeably after this loop exits, and at that point the
discrepancy may start to matter.

But all of the above matters if it is a good idea to batch.  Does it
make a difference?

    ... goes and looks at in_merge_bases_many() ...

Ah, it probably would.  

I thought in_merge_bases_many() would stop early as soon as any of
the traversal from chunk[] reaches commit, but it uses a rather more
generic paint_down_to_common() so extra items in chunk[] that are
topologically older than commit would result in additional traversal
from commit down to them, which would not contribute much to the end
result.  It may be a good #leftovebit idea for future improvement to
teach in_merge_bases_many() to use a custom replacement for
paint_down_to_common() that stops early as soon as we find the
answer is true.

> +	}
> +
> +cleanup_return:
> +	free_reflog_commit_list(&list);
> +	return ret;
> +}
> +

Thanks.