On 09/01/2015 04:13 AM, David Turner wrote: > Instead of a linear search over common_list to check whether > a path is common, use a trie. The trie search operates on > path prefixes, and handles excludes. Here I am, coming late to the discussion as usual. Sorry for that. I dug into this code yesterday and got all nerd-tingly and started refactoring and optimizing it. But after I slept on it I realized that I'm a bit hazy on its justification. A trie is a beautiful data structure and all, but have there been any benchmarks showing (1) that this lookup is a bottleneck and (2) that the trie is an improvement on something simpler, like, say, a sorted string_list? Or is there a realistic hope that the trie might be generally useful for other purposes? The latter seems unlikely, at least with the current implementation, because it is very wasteful of space. It allocates one or two (usually two) `struct trie` for every single string that is added to it. Each `struct trie` contains an array of 256 pointers and usually needs two malloc calls to instantiate. So *each entry* stored in the trie costs something like 4 kilobytes on a 64-bit system, plus usually 4 calls to malloc. The large memory footprint, in turn, will cause access non-locality and might impact the lookup performance. So it is pretty clear that the current code would be unusable for a large number of strings. For this particular application, where we only have 19 strings to store, I suppose we could tolerate the use of approximately 64k of RAM to store 174 characters worth of strings *if* it would bring us big time savings. But I think we need some evidence of the time savings. If this lookup is really a bottleneck, I bet there are other alternatives that are just as fast as this trie and use less code, especially given that there are only 19 strings that need checking. With respect to the implementation, it looks correct to me. I would make the following three suggestions: * Please document that the `contents` field of `struct trie` is not NUL-terminated, because that would otherwise be a common assumption. (It is clearly not NUL-terminated because of the way it is initialized using xmalloc and memcpy in make_trie_node() and because of the way its length is adjusted in add_to_trie().) * But in add_to_trie(), you set `contents` using xstrdup(), which *does* NUL-terminate the string. It would be more consistent to set it using xmalloc and memcpy here. * Please use size_t instead of int for indexing into strings, at least in the trie_find() function, where the input is likely to be under the control of users. If we expected to use the trie for other purposes, then I would suggest a raft of other improvements. Ask me if you are interested. Michael -- Michael Haggerty mhagger@xxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html