Re: [PATCH v5 2/3] path: optimize common dir checking

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/01/2015 04:13 AM, David Turner wrote:
> Instead of a linear search over common_list to check whether
> a path is common, use a trie.  The trie search operates on
> path prefixes, and handles excludes.

Here I am, coming late to the discussion as usual. Sorry for that.

I dug into this code yesterday and got all nerd-tingly and started
refactoring and optimizing it. But after I slept on it I realized that
I'm a bit hazy on its justification. A trie is a beautiful data
structure and all, but have there been any benchmarks showing (1) that
this lookup is a bottleneck and (2) that the trie is an improvement on
something simpler, like, say, a sorted string_list? Or is there a
realistic hope that the trie might be generally useful for other purposes?

The latter seems unlikely, at least with the current implementation,
because it is very wasteful of space. It allocates one or two (usually
two) `struct trie` for every single string that is added to it. Each
`struct trie` contains an array of 256 pointers and usually needs two
malloc calls to instantiate. So *each entry* stored in the trie costs
something like 4 kilobytes on a 64-bit system, plus usually 4 calls to
malloc. The large memory footprint, in turn, will cause access
non-locality and might impact the lookup performance. So it is pretty
clear that the current code would be unusable for a large number of strings.

For this particular application, where we only have 19 strings to store,
I suppose we could tolerate the use of approximately 64k of RAM to store
174 characters worth of strings *if* it would bring us big time savings.
But I think we need some evidence of the time savings.

If this lookup is really a bottleneck, I bet there are other
alternatives that are just as fast as this trie and use less code,
especially given that there are only 19 strings that need checking.

With respect to the implementation, it looks correct to me. I would make
the following three suggestions:

* Please document that the `contents` field of `struct trie` is not
NUL-terminated, because that would otherwise be a common assumption. (It
is clearly not NUL-terminated because of the way it is initialized using
xmalloc and memcpy in make_trie_node() and because of the way its length
is adjusted in add_to_trie().)
* But in add_to_trie(), you set `contents` using xstrdup(), which *does*
NUL-terminate the string. It would be more consistent to set it using
xmalloc and memcpy here.
* Please use size_t instead of int for indexing into strings, at least
in the trie_find() function, where the input is likely to be under the
control of users.

If we expected to use the trie for other purposes, then I would suggest
a raft of other improvements. Ask me if you are interested.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]