Re: [PATCH v2 7/8] diff: add ability to insert additional headers for paths

Johannes Altmanninger <aclopte@xxxxxxxxx> · Wed, 29 Dec 2021 01:16:47 +0100

On Tue, Dec 28, 2021 at 01:09:57PM -0800, Elijah Newren wrote:
> On Tue, Dec 28, 2021 at 2:57 AM Johannes Altmanninger <aclopte@xxxxxxxxx> wrote:
> >
> > On Sat, Dec 25, 2021 at 07:59:18AM +0000, Elijah Newren via GitGitGadget wrote:
> > > +     for (i = 0; i < q->nr; i++) {
> > > +             struct diff_filepair *p = q->queue[i];
> > > +             char *path = p->one->path ? p->one->path : p->two->path;
> > > +
> > > +             if (strmap_contains(o->additional_path_headers, path))
> > > +                     strset_add(&present, path);
> > > +     }
> > > +
> > > +     /*
> > > +      * Loop over paths in additional_path_headers; for each NOT already
> > > +      * in diff_queued_diff, create a synthetic filepair and insert that
> > > +      * into diff_queued_diff.
> > > +      */
> > > +     strmap_for_each_entry(o->additional_path_headers, &iter, e) {
> > > +             if (!strset_contains(&present, e->key)) {
> > > +                     struct diff_filespec *one, *two;
> > > +                     struct diff_filepair *p;
> > > +
> > > +                     one = alloc_filespec(e->key);
> > > +                     two = alloc_filespec(e->key);
> > > +                     fill_filespec(one, null_oid(), 0, 0);
> > > +                     fill_filespec(two, null_oid(), 0, 0);
> > > +                     p = diff_queue(q, one, two);
> > > +                     p->status = DIFF_STATUS_MODIFIED;
> > > +             }
> > > +     }
> >
> > All these string hash-maps are not really typical for a C program. I'm sure
> > they are the best choice for an advanced merge algorithm
> 
> Agreed up to here.
> 
> > but they are not
> > really necessary for computing/printing a diff.
> 
> Technically agree that it _could_ be solved a different way, but the
> strmaps are a much more natural solution to this problem in this
> particular case; more on this below.

Oh yeah, I agree that strmaps are the more intuitive solution.

> 
> > It feels like this is an
> > implementation detail from merge-ort that's leaking into other components.
> 
> And I disagree here, on _both_ the explicit point and the underlying
> suggestion that you seem to be making that strmap should be avoided
> outside of merging.  The strmap.[ch] type was originally a suggestion
> from Peff for areas of git completely unrelated to merging (see the
> beginning of https://lore.kernel.org/git/20200821194857.GD1165@xxxxxxxxxxxxxxxxxxxxxxx/,
> and the first link in that email).  It's a new datatype for git, much
> like strbuf or string_list or whatever before it, that is there to be
> used when it's a natural fit for the problem at hand.  The lack of
> strmap previously led folks to abuse other existing data structures
> (and in a way that often led to poor performance to boot).

Right, all those rename-detection performance fixes were pretty dazzling

> 
> > What we want to do is
> >
> >         for file_pair in additional_headers:
> >                 if not already_queued(file_pair):
> >                         queue(file_pair)
> 
> Yes, precisely.
> 
> > to do that, you use a temporary has-set ("present") that records everything
> > that's already queued (already_queued() is a lookup in that set).
> >
> > Let's assume both the queue and additional_headers are sorted arrays.
> 
> That's a bad assumption; we can't rely on *either* being sorted.  I

OK, I hadn't checked if the queue is sorted

> actually started my implementation by trying exactly what you mention
> first; I too thought it'd be more natural and clearer to do this.  Of
> course, before implementing it, I had to verify whether
> diff_queued_diff was sorted.  So, I added some code that would check
> the order and fail if the queue wasn't sorted.  7 of the test files in
> the regression testsuite had one or more failing tests.
> 
> I think the queue was intended to be sorted (see
> diffcore_fix_diff_index()), but in practice it's not.  And I'm worried
> that if I find the current cases where it fails to be sorted and "fix"
> them (though I don't actually know if this was intentional or not so I
> don't know if that's really a fix or a break), that I'd end up with
> additional cases in the future where they fail to be sorted anyway.
> So, no matter what, relying on diff_queued_diff being sorted seems
> ill-advised.
> 
> Also...
> 
> > Then we could efficiently merge them (like a merge-sort algorithm)
> > without ever allocating a temporary hash map.
> >
> > I haven't checked if this is practical (better wait for feedback).
> > We'd probably need to convert the strmap additional_path_headers into an
> > array and sort it (I guess our hash map does not guarantee any ordering?)
> 
> Right, strmap has no ordering either.  I was willing to stick those
> into a string_list and sort them, but making temporary copies of both
> the strmap and the diff_queued_diff just to sort them so that I can

But you already sort diff_queued_diff at the end of
create_filepairs_for_header_only_notifications(), so sorting a bit earlier
in that function, before enqueueing the new entries won't change the final
result, and allows us to work with a sorted queue; no need for a temporary
copy (we'd only need to copy the strmap).

> reasonably cheaply ask "are items from this thing present in this
> other thing?" seems to be stretching things a bit too far.
> maps/hashes provide a very nice "is this item present" lookup and are
> a natural way to ask that.  Since that is exactly the question I am
> asking, I think they are the better data structure here.

Yeah that makes sense. In theory if we ask
"What is the union of the queued pairs and the extra pairs induced by
conflict messages?"  we could abstract away the "is this item present"
lookup but in practice that's hard.

> So, this was not at all a leak of merge-ort datastructures, but rather a
> picking of the appropriate data structures for the problem at hand.

I think we have two viable solutions to this problem
1. use a temporary strset to figure out which pairs to add
2. use a temporary array, sort it, and "merge" the two arrays

I agree that 1 is more intuitive and natural for humans, and it's probably the
way to go. But it is a bit less elegant because it adds a strmap entry for
each pair in the queue, whereas 2 only needs to add an array element for
each pair with non-content conflicts, which are much fewer. (Okay that's a
minor detail.)  With the right abstractions 2 is pretty simple as well:

	j = 0
	extra_headers = sorted((key, val) for key, val in additional_headers)
	for i in 0..len(queue):
		while j < len(extra_headers) && compare(extra_headers[j].key, queue[i]) <= 0:
			if compare(extra_headers[j].key, queue[i]) < 0:
				enqueue(file_pair_for(extra_headers[j]))
			j++

where

	def compare(key: str, pair: diff_filepair) -> int:
		other = pair.one ? pair.one.path : pair.two.path # Mimic diffnamecmp
		return strcmp(key, other)