On Thu, Sep 06, 2018 at 04:50:33PM -0700, Jonathan Nieder wrote: > Hi, > > Jeff King wrote: > > > But what I think is harmful is a _sorted_ list, because of the > > "accidentally quadratic" nature, and because it's easy to call its > > functions on an unsorted list. > > I agree --- in general, it tends to be better to build an unsorted > string list and then sort it. > > Once I've done so, what is your advice about getting fast lookups > in the result? Should I build an auxiliary hashmap as well? Or > is this an argument for the 'sorted' flag + BUG approach you > already mentioned? I don't see any point in generating a sorted list and _then_ making an auxiliary hashmap. My idea was that if you're using a sorted string-list for lookup, then you can replace the whole thing with a hash (inserting as you go, rather than sorting at the end). I.e., imagine there are three use cases for string lists: 1. Build a list via append(), sort at the end, then do a series of efficient queries. 2. Build a list via insert(), which is always sorted. Possibly query while building, or after finished building. 3. Build an unsorted list and iterate over it. We know that (2) is potentially quadratic during the build-and-sort step. It would be nice to turn it into (1), but it's not always possible if we query it while still building. Turning these into a hashmap is an easy fix with no real downsides. Case (1) actually isn't a problem for run-time. We'd like to get rid of it only because the string_list functions are a bit confusing in terms of which ones expect us to be sorted or not (and if ever forget to sort before querying, we'd see all kinds of subtle bugs). Converting these cases to use a hashmap is one way we might get rid of the confusing string-list functions. And it doesn't carry any big-O downside, though it may be slower in practice (e.g., hashmaps tend to be a bit malloc-heavy). Case (3) is the one we'd like to preserve as the "right" use of string-list, since it's hard to get wrong (after we remove the sorting functions). I think Stefan pointed out a "case 4" in the other part of the thread: ones where we really care not just about fast lookup, but actual iteration order. These ones may need some special care, but I don't think there are many of them. So that's _one_ way to make the world better. Another way is to try to make the functions harder to misuse. E.g., maybe putting "sorted" into the name of string_list_has_string(), so it's on an equal footing with the unsorted variant. Or the sorted flag I mentioned. Those can help the misuse problem, but they don't help with the case (2) quadratic ones. It's probably less work, though. I think I like the hashmap way, if the conversion isn't too painful. As I said to Stefan in the other part of the thread, I'm mostly at the "does this seem like a terrible idea" stage. I'd have to start conversions to see how many of each case we have (and if there are ones that don't fit into my taxonomy above). -Peff