Stefan Beller <sbeller@xxxxxxxxxx> writes: > Sometimes users are given a hash of an object and they want to > identify it further (ex.: Use verify-pack to find the largest blobs, > but what are these? or [1]) > > One might be tempted to extend git-describe to also work with blobs, > such that `git describe <blob-id>` gives a description as > '<commit-ish>:<path>'. This was implemented at [2]; as seen by the sheer > number of responses (>110), it turns out this is tricky to get right. > The hard part to get right is picking the correct 'commit-ish' as that > could be the commit that (re-)introduced the blob or the blob that > removed the blob; the blob could exist in different branches. > > Junio hinted at a different approach of solving this problem, which this > patch implements. Teach the diff machinery another flag for restricting > the information to what is shown. For example: > > $ ./git log --oneline --blobfind=v2.0.0:Makefile > b2feb64309 Revert the whole "ask curl-config" topic for now > 47fbfded53 i18n: only extract comments marked with "TRANSLATORS:" > > we observe that the Makefile as shipped with 2.0 was introduced in > v1.9.2-471-g47fbfded53 and replaced in v2.0.0-rc1-5-gb2feb64309 by > a different blob. > > [1] https://stackoverflow.com/questions/223678/which-commit-has-this-blob > [2] https://public-inbox.org/git/20171028004419.10139-1-sbeller@xxxxxxxxxx/ > > Signed-off-by: Stefan Beller <sbeller@xxxxxxxxxx> > --- > > On playing around with this, trying to find more interesting cases, I observed: > > git log --oneline --blobfind=HEAD:COPYING > 703601d678 Update COPYING with GPLv2 with new FSF address > > git log --oneline --blobfind=703601d678^:COPYING > 459b8d22e5 tests: do not borrow from COPYING and README from the real source > 703601d678 Update COPYING with GPLv2 with new FSF address > 075b845a85 Add a COPYING notice, making it explicit that the license is GPLv2. > > t/diff-lib/COPYING may need an update of the adress of the FSF, > # leftoverbits I guess. I do not think so. See tz/fsf-address-update topic for details. Please do not contaminate the list archive with careless mention of "hash-mark plus left over bits", as it will make searching the real good bits harder. Thanks. > Another interesting case that I found was > git log --oneline --blobfind=v2.14.0:Makefile > 3921a0b3c3 perf: add test for writing the index > 36f048c5e4 sha1dc: build git plumbing code more explicitly > 2118805b92 Makefile: add style build rule > > all of which were after v2.14, such that the introduction of that blob doesn't > show up; I suspect it came in via a merge as unrelated series may have updated > the Makefile in parallel, though git-log should have told me? If that is the case, shouldn't we make this new mode imply --full-history to forbid history simplification? "git log" is a tool to find _an_ explanation of the current state, and the usual history simplification makes tons of sense there, but blobfind is run most likely in order to find _all_ mention of the set of blobs given. > diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt > index dd0dba5b1d..252a21cc19 100644 > --- a/Documentation/diff-options.txt > +++ b/Documentation/diff-options.txt > @@ -500,6 +500,10 @@ information. > --pickaxe-regex:: > Treat the <string> given to `-S` as an extended POSIX regular > expression to match. > +--blobfind=<blob-id>:: > + Restrict the output such that one side of the diff > + matches the given blob-id. > + > endif::git-format-patch[] Can we have a blank line between these enumerations to make the source easier to read? Thanks. > diff --git a/diffcore-blobfind.c b/diffcore-blobfind.c > new file mode 100644 > index 0000000000..5d222fc336 > --- /dev/null > +++ b/diffcore-blobfind.c > @@ -0,0 +1,51 @@ > +/* > + * Copyright (c) 2017 Google Inc. > + */ > +#include "cache.h" > +#include "diff.h" > +#include "diffcore.h" > + > +static void diffcore_filter_blobs(struct diff_queue_struct *q, > + struct diff_options *options) > +{ > + int i, j = 0, c = q->nr; > + > + if (!options->blobfind) > + BUG("blobfind oidset not initialized???"); > + > + for (i = 0; i < q->nr; i++) { > + struct diff_filepair *p = q->queue[i]; > + > + if (DIFF_PAIR_UNMERGED(p) || > + (DIFF_FILE_VALID(p->one) && > + oidset_contains(options->blobfind, &p->one->oid)) || > + (DIFF_FILE_VALID(p->two) && > + oidset_contains(options->blobfind, &p->two->oid))) > + continue; So, we keep an unmerged pair, a pair that mentions a sought-blob on one side or the other side? I am not sure if we want to keep the unmerged pair for the purpose of this one. > + diff_free_filepair(p); > + q->queue[i] = NULL; > + c--; Also, if you are doing the in-place shrinking and have already introduced another counter 'j' that is initialized to 0, I think it makes more sense to do the shrinking in-place. 'i' will stay to be the source-scan pointer that runs 0 thru q->nr, while 'j' can be used in this loop (where you have 'continue') to move the current one that is determined to survive from q->queue[i] to q->queue[j++]. Then you do not need 'c'; when the loop ends, 'j' would be the number of surviving entries and q->nr can be adjusted to it. Unlike the usual pattern taken by the other diffcore transformations where a new queue is populated and the old one discarded, this would leave the q->queue[] over-allocated, but I do not think it is too bad.