Re: [PATCH] blame-tree: add library and tests via "test-tool blame-tree"

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Wed, 8 Mar 2023 10:30:43 -0500

On 3/7/2023 8:56 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Feb 10 2023, Derrick Stolee wrote:

>> All this is to say, that I'd like to see this API start with the smallest
>> possible surface area and with the simplest implementation, and then I'd
>> be happy to contribute those algorithms within the API boundary while the
>> CLI is handled independently.
> 
> I hear your concern about leaving this open for optimization, and in
> general I'd vehemently agree with it, except for needing to eventually
> feed a command-line to setup_revisions().

The most-correct way to build this, with full optimizations, does not
involve revisions.c at all, so this "eventually" is incorrect. It's
only something to do for the "first" implementation, as a reference.

In order to do the single-walk approach for every path simultaneously,
we _must_ have full control of the commit walk. There was a time where
we had done a single-walk approach by letting the revision machinery
walk all commits that changed the base tree, then looked for changes
to the contained paths. However, this results in _incorrect_ results
because commits that would normally be ignored by the simplified
history walk for "<dir>/<entry>" were not ignored by the simplified
history walk for "<dir>/" and thus that algorithm presented _incorrect
results_.

For that reason, doing a single walk that outputs the blame-tree
results for each path must have full control over which commits are
walked and which paths could emit a change for those commits. This
means we must not use revision.c as a base for full control.

> Ideally the revision API would make what you're describing easy, but the
> way it's currently implemented (and changing it would be a much larger
> project) someone who'd like to pass structured options in the way you'd
> describe will end up having to re-implement bug-for-bug compatible
> versions of some subset of the option parsing in revision.c.

The subset of option parsing is "a starting revision" and "a base tree"
and _perhaps_ "is the diff recursive or not?" (and this last one isn't
even in revision.c yet). That does not seem like using revision.c's
parsing is actually helpful at all.

> Isn't a way to get the best of both worlds to have a small snippet of
> code that inspects the "struct rev_info" before & after
> setup_revisions(), and which would only implement certain optimizations
> if certain known options are provided, but not if any unknown ones are?
> 
> That way those who'd like the faster happy path could use that subset of
> options, while the general API would allow any revision options. We'd
> then error() or BUG() out only if we fail to map our expected paths to
> OIDs.

This option requires examining the long and ever-growing list of options
to struct rev_info which will take much more work than parsing a starting
ref and a path from the command-line.

> I think those are all good ways forward here, and I'd much prefer those
> to having to re-implement or pull out subsets of the current option
> parsing logic in revision.c. What do you think?

I think you are skirting over the difficult part about upstreaming the
blame-tree command, which is the biggest reason we have not done it in
the past. The way it is implemented in our fork started with this "just
parse args using revision.c" because that's the easiest way to implement
the naive implementation, but we were able to make optimizations on top
only because we had full control over the callers not using any other
options. We would not have been able to make the assumptions that allowed
those performance enhancements without that control. Actually building the
interface in a way that guarantees the behavior will be stable and
understood is not easy, but is worth doing well.

Thanks,
-Stolee