On 2/5/2023 3:47 PM, Ævar Arnfjörð Bjarmason wrote: I've been meaning to get to this all week, but today was my first chance. > The "blame-tree" library allows for finding the most recent > modification to paths in a tree. It does so by expanding the tree at a > given commit, taking note of the current state of each path, and then > walking backwards through history looking for commits where each path > changed into its final sha1. > > The "git blame-tree" command was first noted on the ML 2011[1], and > over the years there have been mentions of it[2][3], and whether it > could be upstreamed. The sources have been available at [4]'s > "jk/blame-tree-wip" and "jk/faster-blame-tree-wip" branches. > > This change attempts to start upstreaming this command, but rather > than adding a command whose interface & semantics may be controversial > starts by exposing & testing the core of its library through a new > test helper. One thing that I think is important for this work and the future we build on top of it is that we need to reconsider every fact about the existing contribution and make sure the new thing is built for the actual needs. In that vein, I think your use of a test-tool is a really good one. It can help us define an API boundary that could then be reflected in a CLI in independent review from improved algorithms underneath the API. > +struct blame_tree_options { > + struct object_id oid; > + const char *prefix; > + unsigned int recursive; > + struct strvec args; > +}; > +#define BLAME_TREE_OPTIONS_INIT(...) { \ > + .args = STRVEC_INIT, \ > + __VA_ARGS__ \ > +} > +void blame_tree_opts_release(struct blame_tree_options *bto); > + > +struct blame_tree { > + struct string_list paths; > + struct rev_info rev; > +}; > +#define BLAME_TREE_INIT { \ > + .paths = STRING_LIST_INIT_DUP, \ > + .rev = REV_INFO_INIT, \ > +} > + > +void blame_tree_init(struct blame_tree *bt, > + const struct blame_tree_options *opts); > +void blame_tree_release(struct blame_tree *); > + > +typedef void (*blame_tree_fn)(const char *path, const struct commit *commit, > + void *data); > +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data); However, this API is too leaky. Specifically, it allows passing arbitrary 'args' instead of structured options. Before I get into what I think needs to change here, let's look at the initial implementation: > +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata) > +{ > + struct blame_tree_callback_data data; > + > + data.paths = &bt->paths; > + data.num_interesting = bt->paths.nr; > + data.callback = cb; > + data.callback_data = cbdata; > + > + bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK; > + bt->rev.diffopt.format_callback = blame_diff; > + bt->rev.diffopt.format_callback_data = &data; > + > + prepare_revision_walk(&bt->rev); > + > + while (data.num_interesting) { > + data.commit = get_revision(&bt->rev); > + if (!data.commit) > + break; > + > + if (data.commit->object.flags & BOUNDARY) { > + diff_tree_oid(the_hash_algo->empty_tree, > + &data.commit->object.oid, > + "", &bt->rev.diffopt); > + diff_flush(&bt->rev.diffopt); > + } else { > + log_tree_commit(&bt->rev, data.commit); > + } > + } > + > + return 0; > +} When I think of 'blame-tree', I think of the following output: For each path (within a pathspec, recursive or not), output the first commit that changed that path according to simplified file history. This history walk begins at a single commit, but may be terminated by a negative commit range, so the output could indicate that we reached the boundary. With that definition, the most-obvious first implementation would be to first generate the path list, then run `git rev-list -n 1 <revisions> -- <path>` for each path in the pathspec. This could be done by N prepare_revision_walk()/get_revision() executions within a loop instead of actually executing subcommands. The implementation in this patch is simultaneously smarter than that basic approach, but also not smart enough for the best performance. In order to get the optimal performance, the following parts of my output definition are critical: 1. We use simplified file history. 2. The history walk begins at a single commit. If either of these conditions are broken, then the current-best algorithm cannot be used (and even more: our proactive caching mechanism cannot be used). The API as currently defined does not allow us to enforce that restriction because the arbitrary arguments could specify `--full-history`, or worse `--simplify-merge` (and also `--first-parent`) to modify the history mode or even `--all` to modify the number of starting commits. The version we use in our custom fork has the historical baggage of this open-ended argument-parsing, but because we have full control of the callers to the CLI, we have enforced that those arguments are not being used. While we are not currently reviewing the CLI of a builtin, the API layer is essentially providing a pass-through of arbitrary revision options. I am happy to see that the 'struct blame_tree_options' includes a callback for the output, as that is valuable for both reporting results to stdout or to collect results into a list for caching purposes, in the future where we have that ability. --- Recommended API --- Using 'struct blame_tree_options' instead of many parameters creates a good future-proof method prototype, so we can always _add_ options when explicitly understood as important to callers of one kind or another. However, to drop the arbitrary 'args' we need to instead make it very explicit which commits are being used in the history walk. struct blame_tree_options { struct object_id oid; const char *prefix; unsigned int recursive; struct commit *start; struct commit_list *uninteresting; }; This definition of the struct should be enough to demonstrate the behavior we are describing. However, for the v1 of the API, we may even want to completely remove the 'uninteresting' choice, and instead focus only on a single starting position ('start'). If we decide that 'uninteresting' is valuable, then it can be added again later, hopefully after the primary use of this command is established. Again, thinking about the lifetime of this command in our infrastructure, the commit range behavior was used at one point as a performance improvement where a cache was given for a specific starting commit B, then we dynamically computed the blame-tree for the range B..A when given a new commit A, and merged the two results together. However, this was not always correct due to complexities around parallel changes. A different caching mechanism was built into the low-level algorithm itself which resolves these complications, but also _that cache cannot be used when given range queries_. Further, I recommend building the simplest implementation first, since it is actually closer to the very fast implementation in that it has two parts: 1. Compute the list of paths at the tip. 2. Perform history walks for those paths. The fast implementation does a single history walk that essentially walks the simplified history for all of the paths simultaneously, but it is critical to have that list of paths at the start of that walk. In that way it's actually easier to adapt the simpler algorithm to the current-best algorithm than it is to adapt the smart algorithm from this patch to the current-best algorithm. All this is to say, that I'd like to see this API start with the smallest possible surface area and with the simplest implementation, and then I'd be happy to contribute those algorithms within the API boundary while the CLI is handled independently. Thanks, -Stolee