On Thu, Jun 1, 2017 at 5:47 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: > On Thu, Jun 01, 2017 at 02:17:21PM +0200, Peter Zijlstra wrote: >> On Thu, Jun 01, 2017 at 06:58:20AM -0500, Josh Poimboeuf wrote: >> > > Being able to generate more optimal code in the hottest code paths of the kernel >> > > is the _real_, primary upstream kernel benefit of a different debuginfo method - >> > > which has to be weighed against the pain of introducing a new unwinder. But this >> > > submission does not talk about that aspect at all, which should be fixed I think. >> > >> > Actually I devoted an entire one-sentence paragraph to performance in >> > the documentation: >> > >> > The simpler debuginfo format also enables the unwinder to be relatively >> > fast, which is important for perf and lockdep. >> > >> > But I'll try to highlight that a little more. >> >> That's relative to a DWARF unwinder. > > Yes. > >> It doesn't appear to be possible to get anywhere near a frame-pointer >> unwinder due to having to do this log(n) lookup for every single >> frame. > > Hm, is there something faster, yet not substantially bigger? Hash? > Trie? You have, roughly, a set of (key_start, value) pairs where, for any given key, you want to find the (key_start, value) with the largest key_start that doesn't exceed key. Binary search gives you log_2(n) queries, but its locality of reference sucks. Here are two suggestions for improving it: 1. Change the data layout. Instead of having an array of undwarf entries, have two parallel arrays, one with the ip addresses and one with everything else. This has no effect on the amount of space used, but it makes the part used during search more compact. 2. Your key space is fairly small and your table entries should be reasonably uniformly distributed. Let the first IP you have unwind data for be IP0. Make an array mapping (IP - IP0) / B to the index of the unwind entry for that IP for some suitable block size B. Then, to look up an IP, you'd find the indices of the unwind entries for (IP - IP0) / B and (IP - IP0) / B + 1 and binary search between them. With constant B, this gives you O(1) performance instead of O(log(n)). With B = 1, it's very fast, but the table is huge. With B = 64k or so, maybe you'd get a nice tradeoff of speedup vs size. (With modules, you'd presumably first search an rbtree to find which instance of this data structure you're using and then do the lookup.) 3. Use a B-tree. B-trees are simple if you don't need to deal with insertion and deletion. Presumably you'd choose your internal node size so each internal node is exactly 64 or 128 bytes for good cache performance. This is still O(log(n)) and it uses more comparisons than binary search, but you touch many fewer cache lines. I expect that, if you do #1 and #2, you'd get excellent performance at very little cost to the complexity of the code. -- To unsubscribe from this list: send the line "unsubscribe live-patching" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html