On 02/28/2019 07:09 PM, H.J. Lu wrote: > On Thu, Feb 28, 2019 at 9:58 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >> On 02/28/2019 05:25 PM, H.J. Lu wrote: >>> On Thu, Feb 28, 2019 at 8:18 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: >>>> On 02/28/2019 01:53 PM, H.J. Lu wrote: >>>>> On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote: >>>>>> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote: >>>>>>> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1 >>>>>>> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1 >>>>>>> Author: Daniel Borkmann <daniel@xxxxxxxxxxxxx> >>>>>>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100 >>>>>>> Committer: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >>>>>>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100 >>>>>>> >>>>>>> x86, retpolines: Raise limit for generating indirect calls from switch-case >>>>>>> >>>>>>> From networking side, there are numerous attempts to get rid of indirect >>>>>>> calls in fast-path wherever feasible in order to avoid the cost of >>>>>>> retpolines, for example, just to name a few: >>>>>>> >>>>>>> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin") >>>>>>> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer") >>>>>>> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer") >>>>>>> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct") >>>>>>> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps") >>>>>>> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions") >>>>>>> [...] >>>>>>> >>>>>>> Recent work on XDP from Björn and Magnus additionally found that manually >>>>>>> transforming the XDP return code switch statement with more than 5 cases >>>>>>> into if-else combination would result in a considerable speedup in XDP >>>>>>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled >>>>>>> builds. >>>>>> >>>>>> +HJL >>>>>> >>>>>> This is a GCC bug, surely? It should know how expensive each >>>>>> instruction is, and choose which to use accordingly. That should be >>>>>> true even when the indirect branch "instruction" is a retpoline, and >>>>>> thus enormously expensive. >>>>>> >>>>>> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so >>>>>> please at least reference that bug, and be prepared to turn this hack >>>>>> off when GCC is fixed. >>>>> >>>>> We couldn't find a testcase to show jump table with indirect branch >>>>> is slower than direct branches. >>>> >>>> Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table >>>> with the below on top. >>>> >>>> Makefile | 6 +++--- >>>> switch.c | 2 +- >>>> test.c | 6 ++++-- >>>> 3 files changed, 8 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/Makefile b/Makefile >>>> index bd83233..ea81520 100644 >>>> --- a/Makefile >>>> +++ b/Makefile >>>> @@ -1,16 +1,16 @@ >>>> CC=gcc >>>> CFLAGS=-g -I. >>>> -CFLAGS+=-O2 -mindirect-branch=thunk >>>> +CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> >>> Does slowdown show up only with -mindirect-branch=thunk-inline? >> >> Not really, numbers are in similar range / outcome. Additionally, I also tried >> on a bit bigger machine (Xeon Gold 5120 this time). First is thunk-inline, second >> is thunk, and third is w/o raising limit for comparison; first test (from last >> mail) on that machine: > > Please re-open: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 > > with new info. Yeah will do, thanks!