Re: ordered string-list considered harmful, was Re: [PATCH v3] Allow aliases that include other aliases

Jeff King <peff@xxxxxxxx> · Thu, 6 Sep 2018 23:24:02 -0400

On Thu, Sep 06, 2018 at 04:50:33PM -0700, Jonathan Nieder wrote:

> Hi,
> 
> Jeff King wrote:
> 
> > But what I think is harmful is a _sorted_ list, because of the
> > "accidentally quadratic" nature, and because it's easy to call its
> > functions on an unsorted list.
> 
> I agree --- in general, it tends to be better to build an unsorted
> string list and then sort it.
> 
> Once I've done so, what is your advice about getting fast lookups
> in the result?  Should I build an auxiliary hashmap as well?  Or
> is this an argument for the 'sorted' flag + BUG approach you
> already mentioned?

I don't see any point in generating a sorted list and _then_ making an
auxiliary hashmap. My idea was that if you're using a sorted string-list
for lookup, then you can replace the whole thing with a hash (inserting
as you go, rather than sorting at the end).

I.e., imagine there are three use cases for string lists:

  1. Build a list via append(), sort at the end, then do a series of
     efficient queries.

  2. Build a list via insert(), which is always sorted. Possibly query
     while building, or after finished building.

  3. Build an unsorted list and iterate over it.

We know that (2) is potentially quadratic during the build-and-sort
step. It would be nice to turn it into (1), but it's not always possible
if we query it while still building. Turning these into a hashmap is an
easy fix with no real downsides.

Case (1) actually isn't a problem for run-time. We'd like to get rid of
it only because the string_list functions are a bit confusing in terms
of which ones expect us to be sorted or not (and if ever forget to sort
before querying, we'd see all kinds of subtle bugs).

Converting these cases to use a hashmap is one way we might get rid of
the confusing string-list functions. And it doesn't carry any big-O
downside, though it may be slower in practice (e.g., hashmaps tend to be
a bit malloc-heavy).

Case (3) is the one we'd like to preserve as the "right" use of
string-list, since it's hard to get wrong (after we remove the sorting
functions).

I think Stefan pointed out a "case 4" in the other part of the thread:
ones where we really care not just about fast lookup, but actual
iteration order.  These ones may need some special care, but I don't
think there are many of them.

So that's _one_ way to make the world better.

Another way is to try to make the functions harder to misuse. E.g.,
maybe putting "sorted" into the name of string_list_has_string(), so
it's on an equal footing with the unsorted variant. Or the sorted flag I
mentioned. Those can help the misuse problem, but they don't help with
the case (2) quadratic ones. It's probably less work, though.

I think I like the hashmap way, if the conversion isn't too painful. As
I said to Stefan in the other part of the thread, I'm mostly at the
"does this seem like a terrible idea" stage. I'd have to start
conversions to see how many of each case we have (and if there are ones
that don't fit into my taxonomy above).

-Peff