Re: Idea for "function meta"

Menglong Dong <menglong8.dong@xxxxxxxxx> · Fri, 7 Feb 2025 16:16:34 +0800

On Sat, Jan 4, 2025 at 3:28 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, Dec 24, 2024 at 7:25 PM Menglong Dong <menglong8.dong@xxxxxxxxx> wrote:
> >
> > On Fri, Dec 20, 2024 at 10:01 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Dec 20, 2024 at 09:57:22PM +0800, Menglong Dong wrote:
> > >
> > > > However, the other 5-bytes will be consumed if CFI_CLANG is
> > > > enabled, and the space is not enough anymore in this case, and
> > > > the insn will be like this:
> > > >
> > > > __cfi_do_test:
> > > > mov (5byte)
> > > > nop nop (2 bytes)
> > > > sarq (9 bytes)
> > > > do_test:
> > > > xxx
> > > >
> > >
> > > FineIBT will fully consume those 16 bytes.
> > >
> > > Also, text is ROX, you cannot easily write there. Furthermore, writing
> > > non-instructions there will destroy disassemblers ability to make sense
> > > of the memory.
> >
> > Thanks for the reply. Your words make sense, and it
> > seems to be dangerous too.
>
> Raw bytes are indeed dangerous in the text section, but
> I think we can make it work.
>
> We can prepend 5 byte mov %eax, 0x12345678
> or 10 byte mov %rax, 0x12345678aabbccdd
> instructions before function entry and before FineIBT/kcfi preamble.
>
> Ideally 5 byte insn and use 4 byte as an offset within 4Gb region
> for this per-function metadata that we will allocate on demand.
> We can prototype with 10 byte insn and full 8 byte pointer to metadata.
> Without mitigations it will be
> -fpatchable-function-entry=10
> with FineIBT
> -fpatchable-function-entry=26
>
> but we have to measure the impact on I-cache iTLB first.
>
> Menglong,
> could you do performance benchmarking for no-mitigation kernel
> with extra 5 and extra 10 bytes of padding ?

Hi Alexei,

(Sorry for the late reply, I was celebrating the Spring Festival a few
days ago :/ )

I did some performance benchmarking recently with sysbench.
The only case that I did is threads creating benchmarking:

  sysbench --time=60 threads run

I disabled mitigation, and compile a 5-bytes padding kernel with
following changes:

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2485,10 +2485,10 @@ config FUNCTION_PADDING_CFI
 config FUNCTION_PADDING_BYTES
        int
        default FUNCTION_PADDING_CFI if CFI_CLANG
-       default FUNCTION_ALIGNMENT
+       default 5

 config CALL_PADDING
-       def_bool n
+       def_bool y
        depends on CC_HAS_ENTRY_PADDING && OBJTOOL
        select FUNCTION_ALIGNMENT_16B

I did this testing in a kvm in following steps:

1. isolate 4 cores in the host by adding following params to
   the cmdline:
   isolcpus=0-3
2. run a kvm, and isolate 2 cores in the kvm by adding
   "isolcpus=0,1" to the cmdline of the kvm
3. bind the vcpu threads of the kvm to the CPU that we
    isolated
4. run the following command to performance the benchmarking
   in the kvm:
   taskset -c 0 sysbench --time=60 threads run
   and do the statistics with perf meanwhile:
   perf stat -C 0 -- sleep 10

I did the testing not only for 5-bytes padding, but also for
1-bytes, 3-bytes, 4-bytes,5-bytes, 6-bytes, 7-bytes, 8-bytes,
10-bytes, and following is the results of this testing:

| PADDING(BYTES) | RESULT | cycles    | IPS(insns per seconds) |
stalled cycles per insn | stalled-cycles-frontend |
| ------------ | ------ | --------- | ---------------------- |
----------------------- | ----------------------- |
| 1            | 120577 | 4.790 GHz | 3.05                   | 0.04
                | 12.18%                  |
| 1            | 120657 | 4.815 GHz | 3.05                   | 0.04
                | 12.04%                  |
| 1            | 120172 | 4.789 GHz | 3.05                   | 0.04
                | 12.25%                  |
| 3            | 117454 | 4.804 GHz | 2.98                   | 0.04
                | 12.97%                  |
| 3            | 117418 | 4.815 GHz | 2.98                   | 0.04
                | 13.12%                  |
| 3            | 117864 | 4.815 GHz | 2.98                   | 0.04
                | 13.06%                  |
| 4            | 120825 | 4.767 GHz | 3.08                   | 0.04
                | 11.02%                  |
| 4            | 121361 | 4.816 GHz | 3.08                   | 0.04
                | 11.00%                  |
| 4            | 121227 | 4.792 GHz | 3.08                   | 0.04
                | 11.04%                  |
| 5            | 120214 | 4.804 GHz | 3.05                   | 0.04
                | 10.91%                  |
| 5            | 120295 | 4.772 GHz | 3.07                   | 0.04
                | 10.99%                  |
| 5            | 120980 | 4.798 GHz | 3.07                   | 0.04
                | 11.00%                  |
| 6            | 120151 | 4.776 GHz | 3.05                   | 0.04
                | 11.73%                  |
| 6            | 119700 | 4.803 GHz | 3.04                   | 0.04
                | 11.77%                  |
| 6            | 120030 | 4.789 GHz | 3.05                   | 0.04
                | 11.88%                  |
| 7            | 115081 | 4.789 GHz | 2.93                   | 0.05
                | 13.77%                  |
| 7            | 115681 | 4.795 GHz | 2.94                   | 0.05
                | 13.37%                  |
| 7            | 115954 | 4.817 GHz | 2.95                   | 0.05
                | 13.46%                  |
| 8            | 119675 | 4.768 GHz | 3.04                   | 0.04
                | 12.10%                  |
| 8            | 120442 | 4.824 GHz | 3.05                   | 0.04
                | 12.06%                  |
| 8            | 120260 | 4.793 GHz | 3.04                   | 0.04
                | 12.21%                  |
| 10           | 116292 | 4.788 GHz | 2.97                   | 0.04
                | 12.69%                  |
| 10           | 116543 | 4.815 GHz | 2.97                   | 0.04
                | 12.74%                  |
| 10           | 116654 | 4.794 GHz | 2.97                   | 0.04
                | 12.81%                  |
| 16           | 120051 | 4.786 GHz | 3.05                   | 0.04
                | 11.21%                  |
| 16           | 120450 | 4.808 GHz | 3.05                   | 0.04
                | 11.19%                  |
| 16           | 120562 | 4.831 GHz | 3.05                   | 0.04
                | 11.22%                  |

I haven't found the rule of the impact of the space we padding,
but we can see that the performance is ok for 1,4,5,6,8 bytes
padding, which means that the performance is the same as 16-bytes
padding. But it's not ok for 3-bytes, 7-bytes and 10-bytes padding.

I didn't do the testing for all the possible padding bytes, it consumes
time :/

So it seems that we can add extra 5-bytes to the padding and
don't have performance loss. But I'm not sure if it has any other
impacts.

So we have two ways to implement such a function:
1. add extra 5-bytes padding when necessary. This will make the
   vmlinux is as small as possible.
2. make the FUNCTION_ALIGNMENT 32-bytes, which will make
   the vmlinux ~5% larger.

BTW, we don't need to do anything if CFI_CLANG is not enabled,
as there is 7-bytes spare padding in such cases, which is enough
for us.

What do you think?

Thanks!
Menglong Dong

>
> Since we have:
> select FUNCTION_ALIGNMENT_16B           if X86_64 || X86_ALIGNMENT_16
>
> the functions are aligned to 16 all the time,
> so there is some gap between them.
> Extra -fpatchable-function-entry=5 might be in the noise
> from performance point of view,
> but the ability to provide such per function metadata block
> will be very useful for all kinds of use cases.