So I spent some time the last few days looking at kernel performance on a load that I personally see all the time: "make" on a tree that is pretty much fully built. So we don't have the actual cost of compilation, just the cost of _checking_ what needs to be compiled. It turns out that "make" is pretty piggish (what else is new), but this is actually a fairly kernel-intensive load: there's a fair amount of time spent in filename lookup. And we're doing very well in the filename lookup, except now that I can enable SELinux and not lose the advantage of RCU lookup, I see selinux - and particularly AVC - sucking dead donkey fetuses through a straw. However, the top offender in the top function (avc_has_perm_noaudit()) was actually our generic hlist software prefetching use. On both my Atom laptop (which I was compiling kernels on because I was off getting my master diver certification) and on my workstation Core-i5 670, the profiles clearly showed the prefetch instruction to be the top offender by far. And it's not because it gets all the cache misses. Removing the prefetch actually IMPROVES PERFORMANCE. It turns out that apparently the prefetch causes problems for the pipeline when it doesn't hit in the L1 dTLB. And that pretty much *always* happens, for a very simple reason: the hlist walk will eventually encounter the final NULL. In fact, since it's used for hash tables, it usually will encounter it pretty much immediately. In fact, from what I can tell, at least in that avc_has_perm_noaudit() case, the NULL case is the *common* case - the hash lists are usually short, and if it has a single entry (which is common), the *only* prefetch we do is for that NULL pointer. I'm not kidding. In hlist_for_each[_entry](), we don't actually prefetch the first entry ("head->next"), but we *will* prefetch the NULL ("pos->next"). That's just incredible crap. It's incredible crap that costs us 0.5% on the simple "make a fully made kernel" benchmark on some of the most common modern hardware out there. Seriously. Now, Ingo did some more testing, and on his hardware it looks like even if you special-case NULL, prefetching at all still just hurts - because it causes more L1 dcache activity. For a hash chain, if you actually hit the entry, you do NOT want to prefetch the next entry - even if it isn't NULL. So even aside from the fact that apparently "prefetch(NULL)" is a bad thing on some x86 microarchitectures, it really looks like prefetching is simply bad - full stop. So I've got a few choices: (a) Just stop using prefetch on x86, because the hardware implementation of the software prefetch is clearly broken on common hardware, and the hardware prefetchers do a better job *without* us messing around. Plus it gets rid of stupid cache pollution. (b) make x86 prefetch be conditional and avoid prefetching NULL. But see above: it improves things, but not as much as not doing prefetching at all. (c) just stop the insane prefetching in the hlist cases. and quite frankly, right now I think (c) is the right choice. The fact is, with a hash table, I think it makes sense maybe trying to prefetch the first entry, but prefetching all the entries *but* the first one, and prefetching the NULL? That's just crazy. I'm not going to do (b). Making a prefetch conditional is stupid. The conditional isn't going to predict well (think hash chains again - even if they aren't empty or a single entry, we're talking *short* chains or you have bigger issues). And while I could just tell the x86 people to stop generating the prefetch instruction at all, I do feel that that makes it totally pointless to ever have software prefetch hints in the kernel source code, if the major architecture is known to just ignore them because they are useless. But before I commit that, I'll give architecture people a heads up. Now, notice that right now I'm *only* talking about removing it for the "hlist" cases (patch attached). I suspect we should do the same thing for all the list helpers. If some particular code actually knows the list is long and the prefetch is worth it, maybe we could do it on a case-by-case basis. But this "do prefetching of the next entry in the hlist loop" just seems like utter crap. Comments? Linus
include/linux/list.h | 9 ++++----- 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/include/linux/list.h b/include/linux/list.h index 3a54266a1e85..9ac11148e037 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -664,8 +664,7 @@ static inline void hlist_move_list(struct hlist_head *old, #define hlist_entry(ptr, type, member) container_of(ptr,type,member) #define hlist_for_each(pos, head) \ - for (pos = (head)->first; pos && ({ prefetch(pos->next); 1; }); \ - pos = pos->next) + for (pos = (head)->first; pos ; pos = pos->next) #define hlist_for_each_safe(pos, n, head) \ for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \ @@ -680,7 +679,7 @@ static inline void hlist_move_list(struct hlist_head *old, */ #define hlist_for_each_entry(tpos, pos, head, member) \ for (pos = (head)->first; \ - pos && ({ prefetch(pos->next); 1;}) && \ + pos && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ pos = pos->next) @@ -692,7 +691,7 @@ static inline void hlist_move_list(struct hlist_head *old, */ #define hlist_for_each_entry_continue(tpos, pos, member) \ for (pos = (pos)->next; \ - pos && ({ prefetch(pos->next); 1;}) && \ + pos && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ pos = pos->next) @@ -703,7 +702,7 @@ static inline void hlist_move_list(struct hlist_head *old, * @member: the name of the hlist_node within the struct. */ #define hlist_for_each_entry_from(tpos, pos, member) \ - for (; pos && ({ prefetch(pos->next); 1;}) && \ + for (; pos && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ pos = pos->next)