asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 8 Jan 2024 13:33:57 -0800

On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote:
>
> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
> [...]
> > It turned out there are indeed a bunch of redundant shifts
> > when u32 or s32 is passed into "r" asm constraint.
> >
> > Strangely the shifts are there when compiled with -mcpu=v3 or v4
> > and no shifts with -mcpu=v1 and v2.
> >
> > Also weird that u8 and u16 are passed into "r" without redundant shifts.
> > Hence I found a "workaround": cast u32 into u16 while passing.
> > The truncation of u32 doesn't happen and shifts to zero upper 32-bit
> > are gone as well.
> >
> > https://godbolt.org/z/Kqszr6q3v
>
> Regarding unnecessary shifts.
> Sorry, a long email about minor feature/defect.
>
> So, currently the following C program
> (and it's variations with implicit casts):
>
>     extern unsigned long bar(void);
>     void foo(void) {
>       asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
>     }
>
> Is translated to the following BPF:
>
>     $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
>
>     <stdin>:    file format elf64-bpf
>
>     Disassembly of section .text:
>
>     0000000000000000 <foo>:
>            0:   call -0x1
>            1:   r0 <<= 0x20
>            2:   r0 >>= 0x20
>            3:   r0 += 0x1
>            4:   exit
>
> Note: no additional shifts are generated when "w" (32-bit register)
>       constraint is used instead of "r".
>
> First, is this right or wrong?
> ------------------------------
>
> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
> the following:
>
>   If the value of the expression is represented with greater range or
>   precision than required by the type named by the cast (6.3.1.8),
>   then the cast specifies a conversion even if the type of the
>   expression is the same as the named type and removes any extra range
>   and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   ^^^^^^^^^^^^^
>
> What other LLVM backends do in such situations?
> Consider the following program translated to amd64 [2] and aarch64 [3]:
>
>     void foo(void) {
>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
>     }
>
> - for amd64 register of proper size is selected for `reg`;
> - for aarch64 warnings about wrong operand size are emitted at (2) and (3)
>   and 64-bit register is used w/o generating any additional instructions.
>
> (Note, however, that 'arm' silently ignores the issue and uses 32-bit
>  registers for all three points).
>
> So, it looks like that something of this sort should be done:
> - either extra precision should be removed via additional instructions;
> - or 32-bit register should be picked for `reg`;
> - or warning should be emitted as in aarch64 case.
>
> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
> [2] https://godbolt.org/z/9nKxaMc5j
> [3] https://godbolt.org/z/1zxEr5b3f
>
>
> Second, what to do?
> -------------------
>
> I think that the following steps are needed:
> - Investigation described in the next section shows that currently two
>   shifts are generated accidentally w/o real intent to shed precision.
>   I have a patch [6] that removes shifts generation, it should be applied.
> - When 32-bit value is passed to "r" constraint:
>   - for cpu v3/v4 a 32-bit register should be selected;
>   - for cpu v1/v2 a warning should be reported.

Thank you for the detailed analysis.

Agree that llvm fix [6] is a necessary step, then
using 'w' in v3/v4 and warn on v1/v2 makes sense too,
but we have this macro:
#define barrier_var(var) asm volatile("" : "+r"(var))
that is used in various bpf production programs.
tetragon is also a heavy user of inline asm.

Right now a full 64-bit register is allocated,
so switching to 'w' might cause unexpected behavior
including rejection by the verifier.
I think it makes sense to align the bpf backend with arm64 and x86,
but we need to broadcast this change widely.

Also need to align with GCC. (Jose cc-ed)

And, the most importantly, we need a way to go back to old behavior,
since u32 var; asm("...":: "r"(var)); will now
allocate "w" register or warn.

What should be the workaround?

I've tried:
u32 var; asm("...":: "r"((u64)var));

https://godbolt.org/z/n4ejvWj7v

and x86/arm64 generate 32-bit truction.
Sounds like the bpf backend has to do it as well.
We should be doing 'wX=wX' in such case (just like x86)
instead of <<=32 >>=32.

I think this should be done as a separate diff.
Our current pattern of using shifts is inefficient and guaranteed
to screw up verifier range analysis while wX=wX is faster
and more verifier friendly.
Yes it's still not going to be 1-1 to old (our current) behavior.

We probably need some macro (like we did for __BPF_CPU_VERSION__)
to identify such fixed llvm, so existing users with '(short)'
workaround and other tricks can detect new vs old compiler.

Looks like we opened a can of worms.
Aligning with x86/arm64 makes sense, but the cost of doing
the right thing is hard to estimate.

>
> Third, why two shifts are generated?
> ------------------------------------
>
> (Details here might be interesting to Yonghong, regular reader could
>  skip this section).
>
> The two shifts are result of interaction between two IR constructs
> `trunc` and `asm`. The C program above looks as follows in LLVM IR
> before machine code generation:
>
>     declare dso_local i64 @bar()
>     define dso_local void @foo(i32 %p) {
>     entry:
>       %call = call i64 @bar()
>       %v32 = trunc i64 %call to i32
>       tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
>       ret void
>     }
>
> Initial selection DAG:
>
>     $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
>     SelectionDAG has 21 nodes:
>       ...
>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>    !     t11: i32 = truncate t10
>    !    t15: i64 = zero_extend t11
>       t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
>         t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>       ...
>
> Note values t11 and t15 marked with (!).
>
> Optimized lowered selection DAG for this fragment:
>
>     t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>   !   t22: i64 = and t10, Constant:i64<4294967295>
>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>       t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>
> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).
>
> DAG after instruction selection:
>
>     t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
>   !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
>   !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>       t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>
> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
> This happens because of the following pattern from BPFInstrInfo.td:
>
>     // 0xffffFFFF doesn't fit into simm32, optimize common case
>     def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>               (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;
>
> So, the two shift instructions are result of translation of (zext (trunc ...)).
> However, closer examination shows that zext DAG node was generated
> almost by accident. Here is the backtrace for when this node was created:
>
>     Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
>         at .../SelectionDAG.cpp:5605
>     (gdb) bt
>     #0  llvm::SelectionDAG::getNode (...)
>         at ...SelectionDAG.cpp:5605
>     #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
>         at .../SelectionDAGBuilder.cpp:537
>     #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
>         at .../SelectionDAGBuilder.cpp:958
>     #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
>         at .../SelectionDAGBuilder.cpp:9640
>         ...
>
> The stack frame #2 is interesting, here is the code for it [4]:
>
>     void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
>                                      const SDLoc &dl, SDValue &Chain, SDValue *Glue,
>                                      const Value *V,
>                                      ISD::NodeType PreferredExtendType) const {
>                                                    ^
>                                                    '-- this is ANY_EXTEND
>       ...
>       for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
>         ...
>                                                    .-- this returns true
>                                                    v
>         if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
>           ExtendKind = ISD::ZERO_EXTEND;
>
>                                .-- this is ZERO_EXTEND
>                                v
>         getCopyToParts(..., ExtendKind);
>         Part += NumParts;
>       }
>       ...
>     }
>
> The getCopyToRegs() function was called with ANY_EXTEND preference,
> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
> true for any 32 to 64-bit conversion [5].
> However, in this case this is clearly a mistake, as zero extension of
> (zext i64 (truncate i32 ...)) costs two instructions.
>
> The isZExtFree() behavior could be changed to report false for such
> situations, as in my patch [6]. This in turn removes zext =>
> removes two shifts from final asm.
> Here is how DAG/asm look after patch [6]:
>
>     Initial selection DAG:
>       ...
>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>   !   t11: i32 = truncate t10
>       t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
>         t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
>       ...
>
> Final asm:
>
>     ...
>     # %bb.0:
>         call bar
>         #APP
>         r0 += 1
>         #NO_APP
>         exit
>     ...
>
> Note that [6] is a very minor change, it does not affect code
> generation for selftests at all and I was unable to conjure examples
> where it has effect aside from inline asm parameters.
>
> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef
>