Re: [PATCH 1/3] list-objects: add filter_blob to traverse_commit_list

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Thu, 22 Jun 2017 14:45:26 -0700

On Thu, 22 Jun 2017 20:36:13 +0000
Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote:

> From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx>
> 
> In preparation for partial/sparse clone/fetch where the
> server is allowed to omit large/all blobs from the packfile,
> teach traverse_commit_list() to take a blob filter-proc that
> controls when blobs are shown and marked as SEEN.
> 
> Normally, traverse_commit_list() always marks visited blobs
> as SEEN, so that the show_object() callback will never see
> duplicates.  Since a single blob OID may be referenced by
> multiple pathnames, we may not be able to decide if a blob
> should be excluded until later pathnames have been traversed.
> With the filter-proc, the automatic deduping is disabled.

Comparing this approach (this patch and the next one) against mine [1],
I see that this has the advantage of (in pack-objects) avoiding the
invocation of add_preferred_base_object() on blobs that are filtered
out, avoiding polluting the "to_pack" data structure with information
that we do not need.

But it does add an extra place where blobs are filtered out (besides
add_object_entry()), which makes it harder for the reader to keep track
of what's going on. I took a brief look to see if filtering could be
eliminated from add_object_entry(), but that function is called from
many places, so I couldn't tell.

Anyway, I think both approaches will work (this patch's and mine [1]).

[1] https://public-inbox.org/git/6f7934621717141ce3bb6bc05cf1d59c7900ccc5.1496432147.git.jonathantanmy@xxxxxxxxxx/

[snip]

>  static void process_blob(struct rev_info *revs,
>  			 struct blob *blob,
>  			 show_object_fn show,
> +			 filter_blob_fn filter_blob,
>  			 struct strbuf *path,
>  			 const char *name,
>  			 void *cb_data)
> @@ -24,11 +25,28 @@ static void process_blob(struct rev_info *revs,
>  		die("bad blob object");
>  	if (obj->flags & (UNINTERESTING | SEEN))
>  		return;
> -	obj->flags |= SEEN;
>  
>  	pathlen = path->len;
>  	strbuf_addstr(path, name);
> -	show(obj, path->buf, cb_data);
> +	if (!filter_blob) {
> +		/*
> +		 * Normal processing is to imediately dedup blobs
> +		 * during commit traversal, regardless of how many
> +		 * times it appears in a single or multiple commits,
> +		 * so we always set SEEN.
> +		 */
> +		obj->flags |= SEEN;
> +		show(obj, path->buf, cb_data);
> +	} else {
> +		/*
> +		 * Use the filter-proc to decide whether to show
> +		 * the blob.  We only set SEEN if requested.  For
> +		 * example, this could be used to omit a specific
> +		 * blob until it appears with a ".git*" entryname.
> +		 */
> +		if (filter_blob(obj, path->buf, &path->buf[pathlen], cb_data))
> +			obj->flags |= SEEN;
> +	}
>  	strbuf_setlen(path, pathlen);
>  }

After looking at this code again, I wonder if we should explicitly
document that SEEN will be set on the object before show(), and that
show() is allowed to unset SEEN if it wants traversal to process that
object again. That seems to accomplish what we want to accomplish
without this expansion of the API.

(This does mean that the next patch will need to call strrchr() itself,
since show() does not provide the basename of the file name.)

Having said that, if we keep with the current approach, I think there
should be a show() call after the invocation to filter_blob(). That is
much less surprising to me, and also allows us to remove some
show_object() invocations in the next patch.

> +void traverse_commit_list_filtered(
> +	struct rev_info *revs,
> +	show_commit_fn show_commit,
> +	show_object_fn show_object,
> +	filter_blob_fn filter_blob,
> +	void *data)
> +{
>  	int i;
>  	struct commit *commit;
>  	struct strbuf base;

Git style is to indent to the location after the first "(", I believe.
Likewise in the header file below.

>  typedef void (*show_commit_fn)(struct commit *, void *);
>  typedef void (*show_object_fn)(struct object *, const char *, void *);
> +typedef int  (*filter_blob_fn)(struct object *, const char *, const char *, void *);

Can this have parameter names added (especially both the const char *
ones, otherwise indistinguishable) and, if necessary, documented?