Hi Jeff, On Thu, 18 Apr 2019, Jeff Hostetler wrote: > On 4/10/2019 1:37 PM, Slavica Djukic via GitGitGadget wrote: > > From: Slavica Djukic <slawica92@xxxxxxxxxxx> > > > > In the `git add -i` command, we show unique prefixes of the commands and > > files, to give an indication what prefix would select them. > > > > Naturally, the C implementation looks a lot different than the Perl > > implementation: in Perl, a trie is much easier implemented, while we > > already have a pretty neat hashmap implementation in C that we use for > > the purpose of storing (not necessarily unique) prefixes. > > > > The idea: for each item that we add, we generate prefixes starting with > > the first letter, then the first two letters, then three, etc, until we > > find a prefix that is unique (or until the prefix length would be > > longer than we want). If we encounter a previously-unique prefix on the > > way, we adjust that item's prefix to make it unique again (or we mark it > > as having no unique prefix if we failed to find one). These partial > > prefixes are stored in a hash map (for quick lookup times). > > > > To make sure that this function works as expected, we add a test using a > > special-purpose test helper that was added for that purpose. > > > > Note: We expect the list of prefix items to be passed in as a list of > > pointers rather than as regular list to avoid having to copy information > > (the actual items will most likely contain more information than just > > the name and the length of the unique prefix, but passing in `struct > > prefix_item *` would not allow for that). > > > > Signed-off-by: Slavica Djukic <slawica92@xxxxxxxxxxx> > > Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx> > > --- > > diff --git a/prefix-map.c b/prefix-map.c > > new file mode 100644 > > index 0000000000..3c5ae4ae0a > > --- /dev/null > > +++ b/prefix-map.c > > @@ -0,0 +1,111 @@ > > +#include "cache.h" > > +#include "prefix-map.h" > > + > > +static int map_cmp(const void *unused_cmp_data, > > + const void *entry, > > + const void *entry_or_key, > > + const void *unused_keydata) > > +{ > > + const struct prefix_map_entry *a = entry; > > + const struct prefix_map_entry *b = entry_or_key; > > + > > + return a->prefix_length != b->prefix_length || > > + strncmp(a->name, b->name, a->prefix_length); > > +} > > + > > +static void add_prefix_entry(struct hashmap *map, const char *name, > > + size_t prefix_length, struct prefix_item *item) > > +{ > > + struct prefix_map_entry *result = xmalloc(sizeof(*result)); > > + result->name = name; > > + result->prefix_length = prefix_length; > > + result->item = item; > > + hashmap_entry_init(result, memhash(name, prefix_length)); > > + hashmap_add(map, result); > > +} > > + > > +static void init_prefix_map(struct prefix_map *prefix_map, > > + int min_prefix_length, int max_prefix_length) > > +{ > > + hashmap_init(&prefix_map->map, map_cmp, NULL, 0); > > + prefix_map->min_length = min_prefix_length; > > + prefix_map->max_length = max_prefix_length; > > +} > > + > > +static void add_prefix_item(struct prefix_map *prefix_map, > > + struct prefix_item *item) > > +{ > > + struct prefix_map_entry *e = xmalloc(sizeof(*e)), *e2; > > + int j; > > + > > + e->item = item; > > + e->name = e->item->name; > > + > > + for (j = prefix_map->min_length; j <= prefix_map->max_length; j++) { > > + if (!isascii(e->name[j])) { > > This feels odd, if I understand the intent. > > First, why "isascii()" rather than just non-zero? That's to imitate `git-add--interactive.perl`'s if (ord($letters[0]) > 127 || ($soft_limit && $j + 1 > $soft_limit)) See https://github.com/git/git/blob/v2.21.0/git-add--interactive.perl#L410 for more complete context. I think the main benefit here is that we avoid running into the trap of using incomplete UTF-8 multi-byte sequences in prefixes. I guess we could throw in an extra safety on the C side by excluding control characters, too. But that would be a deviation from Perl, and I actually do not even feel strongly about excluding, say, a HT (horizontal tab) from the prefixes. > But mainly, can we walk off the end of the array and read > potentially uninitialized memory? Shouldn't we have something > at the top of the function like: > > len = strlen(item->name); > if (len < prefix_map->min_length) > return; Ooops, you're right. But I would not use `strlen() here, we can easily just add `&& e->name[j]` to the loop condition. > (And maybe avoid the xmalloc() too?) Hmm. At first, I thought: no, we use `*e` *both* for lookup and for adding a new item once we did not find any existing for the current prefix length. But it does indeed become a lot clearer when I separate those. It's not even performance or memory critical a code path. > And maybe do " j <= min(len, max_length) " in the loop? > But I see you're modifying "j" down in the body of the loop, > so I'll wait on suggesting that. > > > + free(e); > > + break; > > + } > > + > > + e->prefix_length = j; > > + hashmap_entry_init(e, memhash(e->name, j)); > > + e2 = hashmap_get(&prefix_map->map, e, NULL); > > + if (!e2) { > > + /* prefix is unique so far */ > > + e->item->prefix_length = j; > > + hashmap_add(&prefix_map->map, e); > > + break; > > + } > > + > > + if (!e2->item) > > + continue; /* non-unique prefix */ > > + > > + if (j != e2->item->prefix_length) > > + BUG("unexpected prefix length: %d != %d", > > + (int)j, (int)e2->item->prefix_length); > > IIUC, this assurance comes directly from map_cmp(), right? > We could strengthen this to > (j != e2->item->prefix_length || strncmp(...)) > if we wanted to, right? Right, I'll actually go for `memcmp()` here, but the idea is the same. > > + > > + /* skip common prefix */ > > + for (; j < prefix_map->max_length && e->name[j]; j++) { > > + if (e->item->name[j] != e2->item->name[j]) > > + break; > > Same comment here about walking off of the defined end of both arrays. Actually, no, not here, as I already test for `e->name[j]` in the loop condition. If we reach the end of `e2->item->name`, the inner condition will break out of the loop. > I'm going to stop here. I'm getting confused. Oh no ;-) Thank you for your helpful comments! Ciao, Dscho