>> On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote: >>> >>> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote: >>> [...] >>> > It turned out there are indeed a bunch of redundant shifts >>> > when u32 or s32 is passed into "r" asm constraint. >>> > >>> > Strangely the shifts are there when compiled with -mcpu=v3 or v4 >>> > and no shifts with -mcpu=v1 and v2. >>> > >>> > Also weird that u8 and u16 are passed into "r" without redundant shifts. >>> > Hence I found a "workaround": cast u32 into u16 while passing. >>> > The truncation of u32 doesn't happen and shifts to zero upper 32-bit >>> > are gone as well. >>> > >>> > https://godbolt.org/z/Kqszr6q3v >>> >>> Regarding unnecessary shifts. >>> Sorry, a long email about minor feature/defect. >>> >>> So, currently the following C program >>> (and it's variations with implicit casts): >>> >>> extern unsigned long bar(void); >>> void foo(void) { >>> asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar())); >>> } >>> >>> Is translated to the following BPF: >>> >>> $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d - >>> >>> <stdin>: file format elf64-bpf >>> >>> Disassembly of section .text: >>> >>> 0000000000000000 <foo>: >>> 0: call -0x1 >>> 1: r0 <<= 0x20 >>> 2: r0 >>= 0x20 >>> 3: r0 += 0x1 >>> 4: exit >>> >>> Note: no additional shifts are generated when "w" (32-bit register) >>> constraint is used instead of "r". >>> >>> First, is this right or wrong? >>> ------------------------------ >>> >>> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says >>> the following: >>> >>> If the value of the expression is represented with greater range or >>> precision than required by the type named by the cast (6.3.1.8), >>> then the cast specifies a conversion even if the type of the >>> expression is the same as the named type and removes any extra range >>> and precision. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> ^^^^^^^^^^^^^ >>> >>> What other LLVM backends do in such situations? >>> Consider the following program translated to amd64 [2] and aarch64 [3]: >>> >>> void foo(void) { >>> asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long) bar())); // 1 >>> asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int) bar())); // 2 >>> asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3 >>> } >>> >>> - for amd64 register of proper size is selected for `reg`; >>> - for aarch64 warnings about wrong operand size are emitted at (2) and (3) >>> and 64-bit register is used w/o generating any additional instructions. >>> >>> (Note, however, that 'arm' silently ignores the issue and uses 32-bit >>> registers for all three points). >>> >>> So, it looks like that something of this sort should be done: >>> - either extra precision should be removed via additional instructions; >>> - or 32-bit register should be picked for `reg`; >>> - or warning should be emitted as in aarch64 case. >>> >>> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf >>> [2] https://godbolt.org/z/9nKxaMc5j >>> [3] https://godbolt.org/z/1zxEr5b3f >>> >>> >>> Second, what to do? >>> ------------------- >>> >>> I think that the following steps are needed: >>> - Investigation described in the next section shows that currently two >>> shifts are generated accidentally w/o real intent to shed precision. >>> I have a patch [6] that removes shifts generation, it should be applied. >>> - When 32-bit value is passed to "r" constraint: >>> - for cpu v3/v4 a 32-bit register should be selected; >>> - for cpu v1/v2 a warning should be reported. >> >> Thank you for the detailed analysis. >> >> Agree that llvm fix [6] is a necessary step, then >> using 'w' in v3/v4 and warn on v1/v2 makes sense too, >> but we have this macro: >> #define barrier_var(var) asm volatile("" : "+r"(var)) >> that is used in various bpf production programs. >> tetragon is also a heavy user of inline asm. >> >> Right now a full 64-bit register is allocated, >> so switching to 'w' might cause unexpected behavior >> including rejection by the verifier. >> I think it makes sense to align the bpf backend with arm64 and x86, >> but we need to broadcast this change widely. >> >> Also need to align with GCC. (Jose cc-ed) > > GCC doesn't have an integrated assembler, so using -masm=pseudoc it just > compiles the program above to: > > foo: > call bar > r0 += 1 > exit > > Also, at the moment we don't support a "w" constraint, because the > assembly-like assembly syntax we started with implies different > instructions that interpret the values stored in the BPF 64-bit > registers as 32-bit or 64-bit values, i.e. > > mov %r1, 1 > mov32 %r1, 1 > > But then the pseudo-c assembly syntax (that we also support) translates > some of the semantics of the instructions to the register names, > creating the notion that BPF actually has both 32-bit registers and > 64-bit registers, i.e. > > r1 += 1 > w1 += 1 > > In GCC we support both assembly syntaxes and currently we lack the > ability to emit 32-bit variants in templates like "%[reg] += 1", so I > suppose we can introduce a "w" constraint to: > > 1. When assembly-like assembly syntax is used, expect a 32-bit mode to > match the operand and warn about operand size overflow whenever > necessary. Always emit "%r" as the register name. > > 2. When pseudo-c assembly syntax is used, expect a 32-bit mode to match > the operand and warn about operand size overflow whenever necessary, > and then emit "w" instead of "r" as the register name. > >> And, the most importantly, we need a way to go back to old behavior, >> since u32 var; asm("...":: "r"(var)); will now >> allocate "w" register or warn. > > Is it really necessary to change the meaning of "r"? You can write > templates like the one triggering this problem like: > > asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar())); > > Then the checks above will be performed, driven by the particular > constraint explicitly specified by the user, not driven by the type of > the value passed as the operand. > > Or am I misunderstanding? [I have just added a proposal for an agenda item to this week's BPF Office Hours so we can discuss about BPF sub-registers and compiler constraints, to complement this thread.] > >> What should be the workaround? >> >> I've tried: >> u32 var; asm("...":: "r"((u64)var)); >> >> https://godbolt.org/z/n4ejvWj7v >> >> and x86/arm64 generate 32-bit truction. >> Sounds like the bpf backend has to do it as well. >> We should be doing 'wX=wX' in such case (just like x86) >> instead of <<=32 >>=32. >> >> I think this should be done as a separate diff. >> Our current pattern of using shifts is inefficient and guaranteed >> to screw up verifier range analysis while wX=wX is faster >> and more verifier friendly. >> Yes it's still not going to be 1-1 to old (our current) behavior. >> >> We probably need some macro (like we did for __BPF_CPU_VERSION__) >> to identify such fixed llvm, so existing users with '(short)' >> workaround and other tricks can detect new vs old compiler. >> >> Looks like we opened a can of worms. >> Aligning with x86/arm64 makes sense, but the cost of doing >> the right thing is hard to estimate. >> >>> >>> Third, why two shifts are generated? >>> ------------------------------------ >>> >>> (Details here might be interesting to Yonghong, regular reader could >>> skip this section). >>> >>> The two shifts are result of interaction between two IR constructs >>> `trunc` and `asm`. The C program above looks as follows in LLVM IR >>> before machine code generation: >>> >>> declare dso_local i64 @bar() >>> define dso_local void @foo(i32 %p) { >>> entry: >>> %call = call i64 @bar() >>> %v32 = trunc i64 %call to i32 >>> tail call void asm sideeffect "$0 += 1", "r"(i32 %v32) >>> ret void >>> } >>> >>> Initial selection DAG: >>> >>> $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll >>> SelectionDAG has 21 nodes: >>> ... >>> t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 >>> ! t11: i32 = truncate t10 >>> ! t15: i64 = zero_extend t11 >>> t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15 >>> t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, >>> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 >>> ... >>> >>> Note values t11 and t15 marked with (!). >>> >>> Optimized lowered selection DAG for this fragment: >>> >>> t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 >>> ! t22: i64 = and t10, Constant:i64<4294967295> >>> t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22 >>> t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, >>> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 >>> >>> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff). >>> >>> DAG after instruction selection: >>> >>> t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2 >>> ! t25: i64 = SLL_ri t10, TargetConstant:i64<32> >>> ! t22: i64 = SRL_ri t25, TargetConstant:i64<32> >>> t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22 >>> t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, >>> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 >>> >>> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)). >>> This happens because of the following pattern from BPFInstrInfo.td: >>> >>> // 0xffffFFFF doesn't fit into simm32, optimize common case >>> def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)), >>> (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>; >>> >>> So, the two shift instructions are result of translation of (zext (trunc ...)). >>> However, closer examination shows that zext DAG node was generated >>> almost by accident. Here is the backtrace for when this node was created: >>> >>> Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND >>> at .../SelectionDAG.cpp:5605 >>> (gdb) bt >>> #0 llvm::SelectionDAG::getNode (...) >>> at ...SelectionDAG.cpp:5605 >>> #1 0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND) >>> at .../SelectionDAGBuilder.cpp:537 >>> #2 0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND) >>> at .../SelectionDAGBuilder.cpp:958 >>> #3 0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...) >>> at .../SelectionDAGBuilder.cpp:9640 >>> ... >>> >>> The stack frame #2 is interesting, here is the code for it [4]: >>> >>> void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG, >>> const SDLoc &dl, SDValue &Chain, SDValue *Glue, >>> const Value *V, >>> ISD::NodeType PreferredExtendType) const { >>> ^ >>> '-- this is ANY_EXTEND >>> ... >>> for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) { >>> ... >>> .-- this returns true >>> v >>> if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT)) >>> ExtendKind = ISD::ZERO_EXTEND; >>> >>> .-- this is ZERO_EXTEND >>> v >>> getCopyToParts(..., ExtendKind); >>> Part += NumParts; >>> } >>> ... >>> } >>> >>> The getCopyToRegs() function was called with ANY_EXTEND preference, >>> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns >>> true for any 32 to 64-bit conversion [5]. >>> However, in this case this is clearly a mistake, as zero extension of >>> (zext i64 (truncate i32 ...)) costs two instructions. >>> >>> The isZExtFree() behavior could be changed to report false for such >>> situations, as in my patch [6]. This in turn removes zext => >>> removes two shifts from final asm. >>> Here is how DAG/asm look after patch [6]: >>> >>> Initial selection DAG: >>> ... >>> t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 >>> ! t11: i32 = truncate t10 >>> t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10 >>> t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>, >>> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1 >>> ... >>> >>> Final asm: >>> >>> ... >>> # %bb.0: >>> call bar >>> #APP >>> r0 += 1 >>> #NO_APP >>> exit >>> ... >>> >>> Note that [6] is a very minor change, it does not affect code >>> generation for selftests at all and I was unable to conjure examples >>> where it has effect aside from inline asm parameters. >>> >>> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10 >>> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21 >>> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef >>>