Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Thu, 28 Feb 2019 19:13:03 +0100

On 02/28/2019 07:09 PM, H.J. Lu wrote:
> On Thu, Feb 28, 2019 at 9:58 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>> On 02/28/2019 05:25 PM, H.J. Lu wrote:
>>> On Thu, Feb 28, 2019 at 8:18 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>>>> On 02/28/2019 01:53 PM, H.J. Lu wrote:
>>>>> On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>>>>>> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
>>>>>>> Commit-ID:  ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>>>>>> Gitweb:     https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>>>>>> Author:     Daniel Borkmann <daniel@xxxxxxxxxxxxx>
>>>>>>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
>>>>>>> Committer:  Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>>>>>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
>>>>>>>
>>>>>>> x86, retpolines: Raise limit for generating indirect calls from switch-case
>>>>>>>
>>>>>>> From networking side, there are numerous attempts to get rid of indirect
>>>>>>> calls in fast-path wherever feasible in order to avoid the cost of
>>>>>>> retpolines, for example, just to name a few:
>>>>>>>
>>>>>>>   * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
>>>>>>>   * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
>>>>>>>   * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
>>>>>>>   * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
>>>>>>>   * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
>>>>>>>   * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
>>>>>>>   [...]
>>>>>>>
>>>>>>> Recent work on XDP from Björn and Magnus additionally found that manually
>>>>>>> transforming the XDP return code switch statement with more than 5 cases
>>>>>>> into if-else combination would result in a considerable speedup in XDP
>>>>>>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
>>>>>>> builds.
>>>>>>
>>>>>> +HJL
>>>>>>
>>>>>> This is a GCC bug, surely? It should know how expensive each
>>>>>> instruction is, and choose which to use accordingly. That should be
>>>>>> true even when the indirect branch "instruction" is a retpoline, and
>>>>>> thus enormously expensive.
>>>>>>
>>>>>> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
>>>>>> please at least reference that bug, and be prepared to turn this hack
>>>>>> off when GCC is fixed.
>>>>>
>>>>> We couldn't find a testcase to show jump table with indirect branch
>>>>> is slower than direct branches.
>>>>
>>>> Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
>>>> with the below on top.
>>>>
>>>>  Makefile | 6 +++---
>>>>  switch.c | 2 +-
>>>>  test.c   | 6 ++++--
>>>>  3 files changed, 8 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/Makefile b/Makefile
>>>> index bd83233..ea81520 100644
>>>> --- a/Makefile
>>>> +++ b/Makefile
>>>> @@ -1,16 +1,16 @@
>>>>  CC=gcc
>>>>  CFLAGS=-g -I.
>>>> -CFLAGS+=-O2 -mindirect-branch=thunk
>>>> +CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Does slowdown show up only with -mindirect-branch=thunk-inline?
>>
>> Not really, numbers are in similar range / outcome. Additionally, I also tried
>> on a bit bigger machine (Xeon Gold 5120 this time). First is thunk-inline, second
>> is thunk, and third is w/o raising limit for comparison; first test (from last
>> mail) on that machine:
> 
> Please re-open:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952
> 
> with new info.

Yeah will do, thanks!