On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote: > > On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote: > [...] > > It turned out there are indeed a bunch of redundant shifts > > when u32 or s32 is passed into "r" asm constraint. > > > > Strangely the shifts are there when compiled with -mcpu=v3 or v4 > > and no shifts with -mcpu=v1 and v2. > > > > Also weird that u8 and u16 are passed into "r" without redundant shifts. > > Hence I found a "workaround": cast u32 into u16 while passing. > > The truncation of u32 doesn't happen and shifts to zero upper 32-bit > > are gone as well. > > > > https://godbolt.org/z/Kqszr6q3v > > Regarding unnecessary shifts. > Sorry, a long email about minor feature/defect. > > So, currently the following C program > (and it's variations with implicit casts): > > extern unsigned long bar(void); > void foo(void) { > asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar())); > } > > Is translated to the following BPF: > > $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d - > > <stdin>: file format elf64-bpf > > Disassembly of section .text: > > 0000000000000000 <foo>: > 0: call -0x1 > 1: r0 <<= 0x20 > 2: r0 >>= 0x20 > 3: r0 += 0x1 > 4: exit > > Note: no additional shifts are generated when "w" (32-bit register) > constraint is used instead of "r". > > First, is this right or wrong? > ------------------------------ > > C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says > the following: > > If the value of the expression is represented with greater range or > precision than required by the type named by the cast (6.3.1.8), > then the cast specifies a conversion even if the type of the > expression is the same as the named type and removes any extra range > and precision. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > ^^^^^^^^^^^^^ > > What other LLVM backends do in such situations? > Consider the following program translated to amd64 [2] and aarch64 [3]: > > void foo(void) { > asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long) bar())); // 1 > asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int) bar())); // 2 > asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3 > } > > - for amd64 register of proper size is selected for `reg`; > - for aarch64 warnings about wrong operand size are emitted at (2) and (3) > and 64-bit register is used w/o generating any additional instructions. > > (Note, however, that 'arm' silently ignores the issue and uses 32-bit > registers for all three points). > > So, it looks like that something of this sort should be done: > - either extra precision should be removed via additional instructions; > - or 32-bit register should be picked for `reg`; > - or warning should be emitted as in aarch64 case. > > [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf > [2] https://godbolt.org/z/9nKxaMc5j > [3] https://godbolt.org/z/1zxEr5b3f > > > Second, what to do? > ------------------- > > I think that the following steps are needed: > - Investigation described in the next section shows that currently two > shifts are generated accidentally w/o real intent to shed precision. > I have a patch [6] that removes shifts generation, it should be applied. > - When 32-bit value is passed to "r" constraint: > - for cpu v3/v4 a 32-bit register should be selected; > - for cpu v1/v2 a warning should be reported. Thank you for the detailed analysis. Agree that llvm fix [6] is a necessary step, then using 'w' in v3/v4 and warn on v1/v2 makes sense too, but we have this macro: #define barrier_var(var) asm volatile("" : "+r"(var)) that is used in various bpf production programs. tetragon is also a heavy user of inline asm. Right now a full 64-bit register is allocated, so switching to 'w' might cause unexpected behavior including rejection by the verifier. I think it makes sense to align the bpf backend with arm64 and x86, but we need to broadcast this change widely. Also need to align with GCC. (Jose cc-ed) And, the most importantly, we need a way to go back to old behavior, since u32 var; asm("...":: "r"(var)); will now allocate "w" register or warn. What should be the workaround? I've tried: u32 var; asm("...":: "r"((u64)var)); https://godbolt.org/z/n4ejvWj7v and x86/arm64 generate 32-bit truction. Sounds like the bpf backend has to do it as well. We should be doing 'wX=wX' in such case (just like x86) instead of <<=32 >>=32. I think this should be done as a separate diff. Our current pattern of using shifts is inefficient and guaranteed to screw up verifier range analysis while wX=wX is faster and more verifier friendly. Yes it's still not going to be 1-1 to old (our current) behavior. We probably need some macro (like we did for __BPF_CPU_VERSION__) to identify such fixed llvm, so existing users with '(short)' workaround and other tricks can detect new vs old compiler. Looks like we opened a can of worms. Aligning with x86/arm64 makes sense, but the cost of doing the right thing is hard to estimate. > > Third, why two shifts are generated? > ------------------------------------ > > (Details here might be interesting to Yonghong, regular reader could > skip this section). > > The two shifts are result of interaction between two IR constructs > `trunc` and `asm`. The C program above looks as follows in LLVM IR > before machine code generation: > > declare dso_local i64 @bar() > define dso_local void @foo(i32 %p) { > entry: > %call = call i64 @bar() > %v32 = trunc i64 %call to i32 > tail call void asm sideeffect "$0 += 1", "r"(i32 %v32) > ret void > } > > Initial selection DAG: > > $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll > SelectionDAG has 21 nodes: > ... > t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 > ! t11: i32 = truncate t10 > ! t15: i64 = zero_extend t11 > t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15 > t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, > TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 > ... > > Note values t11 and t15 marked with (!). > > Optimized lowered selection DAG for this fragment: > > t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 > ! t22: i64 = and t10, Constant:i64<4294967295> > t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22 > t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, > TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 > > Note (zext (truncate ...)) converted to (and ... 0xffff_ffff). > > DAG after instruction selection: > > t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2 > ! t25: i64 = SLL_ri t10, TargetConstant:i64<32> > ! t22: i64 = SRL_ri t25, TargetConstant:i64<32> > t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22 > t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, > TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 > > Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)). > This happens because of the following pattern from BPFInstrInfo.td: > > // 0xffffFFFF doesn't fit into simm32, optimize common case > def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)), > (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>; > > So, the two shift instructions are result of translation of (zext (trunc ...)). > However, closer examination shows that zext DAG node was generated > almost by accident. Here is the backtrace for when this node was created: > > Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND > at .../SelectionDAG.cpp:5605 > (gdb) bt > #0 llvm::SelectionDAG::getNode (...) > at ...SelectionDAG.cpp:5605 > #1 0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND) > at .../SelectionDAGBuilder.cpp:537 > #2 0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND) > at .../SelectionDAGBuilder.cpp:958 > #3 0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...) > at .../SelectionDAGBuilder.cpp:9640 > ... > > The stack frame #2 is interesting, here is the code for it [4]: > > void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG, > const SDLoc &dl, SDValue &Chain, SDValue *Glue, > const Value *V, > ISD::NodeType PreferredExtendType) const { > ^ > '-- this is ANY_EXTEND > ... > for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) { > ... > .-- this returns true > v > if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT)) > ExtendKind = ISD::ZERO_EXTEND; > > .-- this is ZERO_EXTEND > v > getCopyToParts(..., ExtendKind); > Part += NumParts; > } > ... > } > > The getCopyToRegs() function was called with ANY_EXTEND preference, > but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns > true for any 32 to 64-bit conversion [5]. > However, in this case this is clearly a mistake, as zero extension of > (zext i64 (truncate i32 ...)) costs two instructions. > > The isZExtFree() behavior could be changed to report false for such > situations, as in my patch [6]. This in turn removes zext => > removes two shifts from final asm. > Here is how DAG/asm look after patch [6]: > > Initial selection DAG: > ... > t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 > ! t11: i32 = truncate t10 > t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10 > t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, > TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1 > ... > > Final asm: > > ... > # %bb.0: > call bar > #APP > r0 += 1 > #NO_APP > exit > ... > > Note that [6] is a very minor change, it does not affect code > generation for selftests at all and I was unable to conjure examples > where it has effect aside from inline asm parameters. > > [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10 > [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21 > [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef >