Re: [PATCH v5 17/17] powerpc64/bpf: Add support for bpf trampolines

Michael Ellerman <mpe@xxxxxxxxxxxxxx> · Mon, 28 Oct 2024 16:46:40 +1100

Hari Bathini <hbathini@xxxxxxxxxxxxx> writes:
> On 10/10/24 3:09 pm, Hari Bathini wrote:
>> On 10/10/24 5:48 am, Michael Ellerman wrote:
>>> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes:
>>>> On Tue, Oct 1, 2024 at 12:18 AM Hari Bathini <hbathini@xxxxxxxxxxxxx> 
>>>> wrote:
>>>>> On 30/09/24 6:25 pm, Alexei Starovoitov wrote:
>>>>>> On Sun, Sep 29, 2024 at 10:33 PM Hari Bathini 
>>>>>> <hbathini@xxxxxxxxxxxxx> wrote:
>>>>>>> On 17/09/24 1:20 pm, Alexei Starovoitov wrote:
>>>>>>>> On Sun, Sep 15, 2024 at 10:58 PM Hari Bathini 
>>>>>>>> <hbathini@xxxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +       /*
>>>>>>>>> +        * Generated stack layout:
>>>>>>>>> +        *
>>>>>>>>> +        * func prev back chain         [ back chain        ]
>>>>>>>>> +        *                              [                   ]
>>>>>>>>> +        * bpf prog redzone/tailcallcnt [ ...               ] 64 
>>>>>>>>> bytes (64-bit powerpc)
>>>>>>>>> +        *                              [                   ] --
>>>>>>>> ...
>>>>>>>>> +
>>>>>>>>> +       /* Dummy frame size for proper unwind - includes 64- 
>>>>>>>>> bytes red zone for 64-bit powerpc */
>>>>>>>>> +       bpf_dummy_frame_size = STACK_FRAME_MIN_SIZE + 64;
>>>>>>>>
>>>>>>>> What is the goal of such a large "red zone" ?
>>>>>>>> The kernel stack is a limited resource.
>>>>>>>> Why reserve 64 bytes ?
>>>>>>>> tail call cnt can probably be optional as well.
>>>>>>>
>>>>>>> Hi Alexei, thanks for reviewing.
>>>>>>> FWIW, the redzone on ppc64 is 288 bytes. BPF JIT for ppc64 was using
>>>>>>> a redzone of 80 bytes since tailcall support was introduced [1].
>>>>>>> It came down to 64 bytes thanks to [2]. The red zone is being used
>>>>>>> to save NVRs and tail call count when a stack is not setup. I do
>>>>>>> agree that we should look at optimizing it further. Do you think
>>>>>>> the optimization should go as part of PPC64 trampoline enablement
>>>>>>> being done here or should that be taken up as a separate item, maybe?
>>>>>>
>>>>>> The follow up is fine.
>>>>>> It just odd to me that we currently have:
>>>>>>
>>>>>> [   unused red zone ] 208 bytes protected
>>>>>>
>>>>>> I simply don't understand why we need to waste this much stack space.
>>>>>> Why can't it be zero today ?
>>>>>
>>>>> The ABI for ppc64 has a redzone of 288 bytes below the current
>>>>> stack pointer that can be used as a scratch area until a new
>>>>> stack frame is created. So, no wastage of stack space as such.
>>>>> It is just red zone that can be used before a new stack frame
>>>>> is created. The comment there is only to show how redzone is
>>>>> being used in ppc64 BPF JIT. I think the confusion is with the
>>>>> mention of "208 bytes" as protected. As not all of that scratch
>>>>> area is used, it mentions the remaining as unused. Essentially
>>>>> 288 bytes below current stack pointer is protected from debuggers
>>>>> and interrupt code (red zone). Note that it should be 224 bytes
>>>>> of unused red zone instead of 208 bytes as red zone usage in
>>>>> ppc64 BPF JIT come down from 80 bytes to 64 bytes since [2].
>>>>> Hope that clears the misunderstanding..
>>>>
>>>> I see. That makes sense. So it's similar to amd64 red zone,
>>>> but there we have an issue with irqs, hence the kernel is
>>>> compiled with -mno-red-zone.
>>>
>>> I assume that issue is that the interrupt entry unconditionally writes
>>> some data below the stack pointer, disregarding the red zone?
>>>
>>>> I guess ppc always has a different interrupt stack and
>>>> it's not an issue?
>>>
>>> No, the interrupt entry allocates a frame that is big enough to cover
>>> the red zone as well as the space it needs to save registers.
>>>
>>> See STACK_INT_FRAME_SIZE which includes KERNEL_REDZONE_SIZE:
>>>
>>>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ 
>>> tree/arch/powerpc/include/asm/ptrace.h? 
>>> commit=8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b#n165
>>>
>>> Which is renamed to INT_FRAME_SIZE in asm-offsets.c and then is used in
>>> the interrupt entry here:
>>>
>>>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ 
>>> tree/arch/powerpc/kernel/exceptions-64s.S? 
>>> commit=8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b#n497
>> 
>> Thanks for clarifying that, Michael.
>> Only async interrupt handlers use different interrupt stacks, right?
>
> ... and separate emergency stack for some special cases...

There isn't a neat rule like sync/async.

Most interrupts use the normal kernel stack, whether sync or async.

External interrupts switch to a separate hard interrupt stack
(hardirq_ctx) in call_do_irq(), but only after coming in on the kernel
stack first.

Some interrupts use the emergency stack (in some cases), eg. HMI, soft
NMI (fake), TM bad thing (program check), or their own stack, system
reset (nmi_emergency_sp), machine check (mc_emergency_sp).

cheers