Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro

"Jose E. Marchesi" <jose.marchesi@xxxxxxxxxx> · Tue, 09 Jan 2024 13:09:24 +0100



>> On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote:
>>>
>>> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
>>> [...]
>>> > It turned out there are indeed a bunch of redundant shifts
>>> > when u32 or s32 is passed into "r" asm constraint.
>>> >
>>> > Strangely the shifts are there when compiled with -mcpu=v3 or v4
>>> > and no shifts with -mcpu=v1 and v2.
>>> >
>>> > Also weird that u8 and u16 are passed into "r" without redundant shifts.
>>> > Hence I found a "workaround": cast u32 into u16 while passing.
>>> > The truncation of u32 doesn't happen and shifts to zero upper 32-bit
>>> > are gone as well.
>>> >
>>> > https://godbolt.org/z/Kqszr6q3v
>>>
>>> Regarding unnecessary shifts.
>>> Sorry, a long email about minor feature/defect.
>>>
>>> So, currently the following C program
>>> (and it's variations with implicit casts):
>>>
>>>     extern unsigned long bar(void);
>>>     void foo(void) {
>>>       asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
>>>     }
>>>
>>> Is translated to the following BPF:
>>>
>>>     $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
>>>
>>>     <stdin>:    file format elf64-bpf
>>>
>>>     Disassembly of section .text:
>>>
>>>     0000000000000000 <foo>:
>>>            0:   call -0x1
>>>            1:   r0 <<= 0x20
>>>            2:   r0 >>= 0x20
>>>            3:   r0 += 0x1
>>>            4:   exit
>>>
>>> Note: no additional shifts are generated when "w" (32-bit register)
>>>       constraint is used instead of "r".
>>>
>>> First, is this right or wrong?
>>> ------------------------------
>>>
>>> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
>>> the following:
>>>
>>>   If the value of the expression is represented with greater range or
>>>   precision than required by the type named by the cast (6.3.1.8),
>>>   then the cast specifies a conversion even if the type of the
>>>   expression is the same as the named type and removes any extra range
>>>   and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>   ^^^^^^^^^^^^^
>>>
>>> What other LLVM backends do in such situations?
>>> Consider the following program translated to amd64 [2] and aarch64 [3]:
>>>
>>>     void foo(void) {
>>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
>>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
>>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
>>>     }
>>>
>>> - for amd64 register of proper size is selected for `reg`;
>>> - for aarch64 warnings about wrong operand size are emitted at (2) and (3)
>>>   and 64-bit register is used w/o generating any additional instructions.
>>>
>>> (Note, however, that 'arm' silently ignores the issue and uses 32-bit
>>>  registers for all three points).
>>>
>>> So, it looks like that something of this sort should be done:
>>> - either extra precision should be removed via additional instructions;
>>> - or 32-bit register should be picked for `reg`;
>>> - or warning should be emitted as in aarch64 case.
>>>
>>> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
>>> [2] https://godbolt.org/z/9nKxaMc5j
>>> [3] https://godbolt.org/z/1zxEr5b3f
>>>
>>>
>>> Second, what to do?
>>> -------------------
>>>
>>> I think that the following steps are needed:
>>> - Investigation described in the next section shows that currently two
>>>   shifts are generated accidentally w/o real intent to shed precision.
>>>   I have a patch [6] that removes shifts generation, it should be applied.
>>> - When 32-bit value is passed to "r" constraint:
>>>   - for cpu v3/v4 a 32-bit register should be selected;
>>>   - for cpu v1/v2 a warning should be reported.
>>
>> Thank you for the detailed analysis.
>>
>> Agree that llvm fix [6] is a necessary step, then
>> using 'w' in v3/v4 and warn on v1/v2 makes sense too,
>> but we have this macro:
>> #define barrier_var(var) asm volatile("" : "+r"(var))
>> that is used in various bpf production programs.
>> tetragon is also a heavy user of inline asm.
>>
>> Right now a full 64-bit register is allocated,
>> so switching to 'w' might cause unexpected behavior
>> including rejection by the verifier.
>> I think it makes sense to align the bpf backend with arm64 and x86,
>> but we need to broadcast this change widely.
>>
>> Also need to align with GCC. (Jose cc-ed)
>
> GCC doesn't have an integrated assembler, so using -masm=pseudoc it just
> compiles the program above to:
>
>   foo:
>   	call bar
>   	r0 += 1
> 	exit
>
> Also, at the moment we don't support a "w" constraint, because the
> assembly-like assembly syntax we started with implies different
> instructions that interpret the values stored in the BPF 64-bit
> registers as 32-bit or 64-bit values, i.e.
>
>   mov %r1, 1
>   mov32 %r1, 1
>
> But then the pseudo-c assembly syntax (that we also support) translates
> some of the semantics of the instructions to the register names,
> creating the notion that BPF actually has both 32-bit registers and
> 64-bit registers, i.e.
>
>   r1 += 1
>   w1 += 1
>
> In GCC we support both assembly syntaxes and currently we lack the
> ability to emit 32-bit variants in templates like "%[reg] += 1", so I
> suppose we can introduce a "w" constraint to:
>
> 1. When assembly-like assembly syntax is used, expect a 32-bit mode to
>    match the operand and warn about operand size overflow whenever
>    necessary.  Always emit "%r" as the register name.
>
> 2. When pseudo-c assembly syntax is used, expect a 32-bit mode to match
>    the operand and warn about operand size overflow whenever necessary,
>    and then emit "w" instead of "r" as the register name.
>
>> And, the most importantly, we need a way to go back to old behavior,
>> since u32 var; asm("...":: "r"(var)); will now
>> allocate "w" register or warn.
>
> Is it really necessary to change the meaning of "r"?  You can write
> templates like the one triggering this problem like:
>
>   asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar()));
>
> Then the checks above will be performed, driven by the particular
> constraint explicitly specified by the user, not driven by the type of
> the value passed as the operand.
>
> Or am I misunderstanding?

[I have just added a proposal for an agenda item to this week's BPF
 Office Hours so we can discuss about BPF sub-registers and compiler
 constraints, to complement this thread.]

>
>> What should be the workaround?
>>
>> I've tried:
>> u32 var; asm("...":: "r"((u64)var));
>>
>> https://godbolt.org/z/n4ejvWj7v
>>
>> and x86/arm64 generate 32-bit truction.
>> Sounds like the bpf backend has to do it as well.
>> We should be doing 'wX=wX' in such case (just like x86)
>> instead of <<=32 >>=32.
>>
>> I think this should be done as a separate diff.
>> Our current pattern of using shifts is inefficient and guaranteed
>> to screw up verifier range analysis while wX=wX is faster
>> and more verifier friendly.
>> Yes it's still not going to be 1-1 to old (our current) behavior.
>>
>> We probably need some macro (like we did for __BPF_CPU_VERSION__)
>> to identify such fixed llvm, so existing users with '(short)'
>> workaround and other tricks can detect new vs old compiler.
>>
>> Looks like we opened a can of worms.
>> Aligning with x86/arm64 makes sense, but the cost of doing
>> the right thing is hard to estimate.
>>
>>>
>>> Third, why two shifts are generated?
>>> ------------------------------------
>>>
>>> (Details here might be interesting to Yonghong, regular reader could
>>>  skip this section).
>>>
>>> The two shifts are result of interaction between two IR constructs
>>> `trunc` and `asm`. The C program above looks as follows in LLVM IR
>>> before machine code generation:
>>>
>>>     declare dso_local i64 @bar()
>>>     define dso_local void @foo(i32 %p) {
>>>     entry:
>>>       %call = call i64 @bar()
>>>       %v32 = trunc i64 %call to i32
>>>       tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
>>>       ret void
>>>     }
>>>
>>> Initial selection DAG:
>>>
>>>     $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
>>>     SelectionDAG has 21 nodes:
>>>       ...
>>>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>>    !     t11: i32 = truncate t10
>>>    !    t15: i64 = zero_extend t11
>>>       t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
>>>         t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>>       ...
>>>
>>> Note values t11 and t15 marked with (!).
>>>
>>> Optimized lowered selection DAG for this fragment:
>>>
>>>     t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>>   !   t22: i64 = and t10, Constant:i64<4294967295>
>>>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>>       t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>>
>>> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).
>>>
>>> DAG after instruction selection:
>>>
>>>     t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
>>>   !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
>>>   !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
>>>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>>       t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>>
>>> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
>>> This happens because of the following pattern from BPFInstrInfo.td:
>>>
>>>     // 0xffffFFFF doesn't fit into simm32, optimize common case
>>>     def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>>>               (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;
>>>
>>> So, the two shift instructions are result of translation of (zext (trunc ...)).
>>> However, closer examination shows that zext DAG node was generated
>>> almost by accident. Here is the backtrace for when this node was created:
>>>
>>>     Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
>>>         at .../SelectionDAG.cpp:5605
>>>     (gdb) bt
>>>     #0  llvm::SelectionDAG::getNode (...)
>>>         at ...SelectionDAG.cpp:5605
>>>     #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
>>>         at .../SelectionDAGBuilder.cpp:537
>>>     #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
>>>         at .../SelectionDAGBuilder.cpp:958
>>>     #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
>>>         at .../SelectionDAGBuilder.cpp:9640
>>>         ...
>>>
>>> The stack frame #2 is interesting, here is the code for it [4]:
>>>
>>>     void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
>>>                                      const SDLoc &dl, SDValue &Chain, SDValue *Glue,
>>>                                      const Value *V,
>>>                                      ISD::NodeType PreferredExtendType) const {
>>>                                                    ^
>>>                                                    '-- this is ANY_EXTEND
>>>       ...
>>>       for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
>>>         ...
>>>                                                    .-- this returns true
>>>                                                    v
>>>         if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
>>>           ExtendKind = ISD::ZERO_EXTEND;
>>>
>>>                                .-- this is ZERO_EXTEND
>>>                                v
>>>         getCopyToParts(..., ExtendKind);
>>>         Part += NumParts;
>>>       }
>>>       ...
>>>     }
>>>
>>> The getCopyToRegs() function was called with ANY_EXTEND preference,
>>> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
>>> true for any 32 to 64-bit conversion [5].
>>> However, in this case this is clearly a mistake, as zero extension of
>>> (zext i64 (truncate i32 ...)) costs two instructions.
>>>
>>> The isZExtFree() behavior could be changed to report false for such
>>> situations, as in my patch [6]. This in turn removes zext =>
>>> removes two shifts from final asm.
>>> Here is how DAG/asm look after patch [6]:
>>>
>>>     Initial selection DAG:
>>>       ...
>>>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>>   !   t11: i32 = truncate t10
>>>       t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
>>>         t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
>>>       ...
>>>
>>> Final asm:
>>>
>>>     ...
>>>     # %bb.0:
>>>         call bar
>>>         #APP
>>>         r0 += 1
>>>         #NO_APP
>>>         exit
>>>     ...
>>>
>>> Note that [6] is a very minor change, it does not affect code
>>> generation for selftests at all and I was unable to conjure examples
>>> where it has effect aside from inline asm parameters.
>>>
>>> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
>>> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
>>> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef
>>>