Re: [PATCH 1/1] diffcore: add a filter to find a specific blob

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 24 Nov 2017 16:43:49 +0900

Stefan Beller <sbeller@xxxxxxxxxx> writes:

> Sometimes users are given a hash of an object and they want to
> identify it further (ex.: Use verify-pack to find the largest blobs,
> but what are these? or [1])
>
> One might be tempted to extend git-describe to also work with blobs,
> such that `git describe <blob-id>` gives a description as
> '<commit-ish>:<path>'.  This was implemented at [2]; as seen by the sheer
> number of responses (>110), it turns out this is tricky to get right.
> The hard part to get right is picking the correct 'commit-ish' as that
> could be the commit that (re-)introduced the blob or the blob that
> removed the blob; the blob could exist in different branches.
>
> Junio hinted at a different approach of solving this problem, which this
> patch implements. Teach the diff machinery another flag for restricting
> the information to what is shown. For example:
>
>   $ ./git log --oneline --blobfind=v2.0.0:Makefile
>   b2feb64309 Revert the whole "ask curl-config" topic for now
>   47fbfded53 i18n: only extract comments marked with "TRANSLATORS:"
>
> we observe that the Makefile as shipped with 2.0 was introduced in
> v1.9.2-471-g47fbfded53 and replaced in v2.0.0-rc1-5-gb2feb64309 by
> a different blob.
>
> [1] https://stackoverflow.com/questions/223678/which-commit-has-this-blob
> [2] https://public-inbox.org/git/20171028004419.10139-1-sbeller@xxxxxxxxxx/
>
> Signed-off-by: Stefan Beller <sbeller@xxxxxxxxxx>
> ---
>
> On playing around with this, trying to find more interesting cases, I observed:
>
>     git log --oneline --blobfind=HEAD:COPYING
>     703601d678 Update COPYING with GPLv2 with new FSF address
>     
>     git log --oneline --blobfind=703601d678^:COPYING
>     459b8d22e5 tests: do not borrow from COPYING and README from the real source
>     703601d678 Update COPYING with GPLv2 with new FSF address
>     075b845a85 Add a COPYING notice, making it explicit that the license is GPLv2.
>
>     t/diff-lib/COPYING may need an update of the adress of the FSF,
>     # leftoverbits I guess.

I do not think so.  See tz/fsf-address-update topic for details.

Please do not contaminate the list archive with careless mention of 
"hash-mark plus left over bits", as it will make searching the real
good bits harder.  Thanks.

> Another interesting case that I found was
>    git log --oneline --blobfind=v2.14.0:Makefile
>    3921a0b3c3 perf: add test for writing the index
>    36f048c5e4 sha1dc: build git plumbing code more explicitly
>    2118805b92 Makefile: add style build rule
>
> all of which were after v2.14, such that the introduction of that blob doesn't
> show up; I suspect it came in via a merge as unrelated series may have updated
> the Makefile in parallel, though git-log should have told me?

If that is the case, shouldn't we make this new mode imply
--full-history to forbid history simplification?  "git log" is a
tool to find _an_ explanation of the current state, and the usual
history simplification makes tons of sense there, but blobfind is
run most likely in order to find _all_ mention of the set of blobs
given.

> diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
> index dd0dba5b1d..252a21cc19 100644
> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -500,6 +500,10 @@ information.
>  --pickaxe-regex::
>  	Treat the <string> given to `-S` as an extended POSIX regular
>  	expression to match.
> +--blobfind=<blob-id>::
> +	Restrict the output such that one side of the diff
> +	matches the given blob-id.
> +
>  endif::git-format-patch[]

Can we have a blank line between these enumerations to make the
source easier to read?  Thanks.

> diff --git a/diffcore-blobfind.c b/diffcore-blobfind.c
> new file mode 100644
> index 0000000000..5d222fc336
> --- /dev/null
> +++ b/diffcore-blobfind.c
> @@ -0,0 +1,51 @@
> +/*
> + * Copyright (c) 2017 Google Inc.
> + */
> +#include "cache.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +
> +static void diffcore_filter_blobs(struct diff_queue_struct *q,
> +				  struct diff_options *options)
> +{
> +	int i, j = 0, c = q->nr;
> +
> +	if (!options->blobfind)
> +		BUG("blobfind oidset not initialized???");
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +
> +		if (DIFF_PAIR_UNMERGED(p) ||
> +		    (DIFF_FILE_VALID(p->one) &&
> +		     oidset_contains(options->blobfind, &p->one->oid)) ||
> +		    (DIFF_FILE_VALID(p->two) &&
> +		     oidset_contains(options->blobfind, &p->two->oid)))
> +			continue;

So, we keep an unmerged pair, a pair that mentions a sought-blob on
one side or the other side?  I am not sure if we want to keep the
unmerged pair for the purpose of this one.

> +		diff_free_filepair(p);
> +		q->queue[i] = NULL;
> +		c--;

Also, if you are doing the in-place shrinking and have already
introduced another counter 'j' that is initialized to 0, I think it
makes more sense to do the shrinking in-place.  'i' will stay to be
the source-scan pointer that runs 0 thru q->nr, while 'j' can be
used in this loop (where you have 'continue') to move the current
one that is determined to survive from q->queue[i] to q->queue[j++].

Then you do not need 'c'; when the loop ends, 'j' would be the
number of surviving entries and q->nr can be adjusted to it.  Unlike
the usual pattern taken by the other diffcore transformations where
a new queue is populated and the old one discarded, this would leave
the q->queue[] over-allocated, but I do not think it is too bad.