Re: [PATCH bpf-next] bpf: abstract loop unrolling pragmas in BPF selftests

Yonghong Song <yonghong.song@xxxxxxxxx> · Thu, 8 Feb 2024 11:44:57 -0800

On 2/8/24 11:03 AM, Jose E. Marchesi wrote:
On 2/8/24 10:04 AM, Yonghong Song wrote:
On 2/8/24 8:51 AM, Jose E. Marchesi wrote:
On Thu, 2024-02-08 at 16:35 +0100, Jose E. Marchesi wrote:
[...]

If the compiler generates assembly code the same code for
profile2.c for
before and after, that means that the loop does _not_ get
unrolled when
profiler.inc.h is built with -O2 but without #pragma unroll.

But what if #pragma unroll is used?  If it unrolls then, that
would mean
that the pragma does something more than -funroll-loops/-O2.

Sorry if I am not making sense.  Stuff like this confuses me to no end
;)
Sorry, I messed up while switching branches :(
Here are the correct stats:

| File            | insn # | insn # |
|                 | before |  after |
|-----------------+--------+--------|
| profiler1.bpf.o |  16716 |   4813 |
This means:

- With both `#pragma unroll' and -O2 we get 16716 instructions.
- Without `#pragma unroll' and with -O2 we get 4813 instructions.

Weird.
Thanks for the analysis. I can reproduce with vs. without '#pragma
unroll' at -O2
level, the number of generated insns is indeed different, quite
dramatically
as the above numbers. I will do some checking in compiler.
Okay, a quick checking compiler found that
   - with "#pragma unroll" means no profitability test and do full
    unroll as instructed

I don't think clang's `#pragma unroll' does full unroll.

On one side, AFAIK `pragma unroll' is supposed to be equivalent to
`pragma clang loop(enable)', which is different to `pragma clang loop
unroll(full)'.

On the other, if you replace `pragma unroll' with `pragma clang loop
unroll(full)' in the BPF selftests you will get branch instruction
overflows.

You are correct. I did series of examples, and find with "#pragma unroll",
clang may do:
  - full unroll (smaller body, less trip count), or
    Loop Unroll: F[kprobe__proc_sys_write] Loop %for.body.i
      Loop Size = 40
      Exiting block %if.then23.i: TripCount=0, TripMultiple=1, BreakoutTrip=1
      Exiting block %for.inc.i: TripCount=10, TripMultiple=0, BreakoutTrip=0
    COMPLETELY UNROLLING loop %for.body.i with trip count 10!
  - partial unroll (for a loop with trip count 10000000), or
    Loop Unroll: F[foo] Loop %for.body
      Loop Size = 16
      partially unrolling with count: 1000
      Exiting block %for.body: TripCount=10000000, TripMultiple=0, BreakoutTrip=0
    UNROLLING loop %for.body by 1000!
  - no unrolling (for a loop with huge body) and issue warning like
    t.c:2:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation
    might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]

With '#pragma unroll', the compiler will do (more strict?) profitability analysis and looks like
by default will not do partial inlining:

  Loop Unroll: F[kprobe__proc_sys_write] Loop %for.body.i
    Loop Size = 40
  Starting LoopUnroll profitability analysis...
   Analyzing iteration 0
  Adding cost of instruction (iteration 0):   call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %__r15.i) #7, !dbg !30496
  Adding cost of instruction (iteration 0):   call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %__t16.i) #7, !dbg !30497
  Adding cost of instruction (iteration 0):   %67 = call ptr @llvm.bpf.passthrough.p0.p0(i32 38, ptr %66)
  Adding cost of instruction (iteration 0):   %66 = getelementptr i8, ptr %17, i64 %65
  Adding cost of instruction (iteration 0):   %65 = load i64, ptr @"llvm.task_struct:0:4656$0:171", align 8
  Can't analyze cost of loop with call
    will not try to unroll partially because -unroll-allow-partial not given

What criteria `pragma unroll' in clang uses in order to determine how
much it unrolls the loop, compared to -O2|-funroll-loops, I don't know.

There are some heuristics in the compiler. I do not know exact algorithm
either.

This makes me wonder, asking from ignorance: what is the benefit/point
for BPF programs to partially unroll a loop?  I would have said either
we unroll them completely in order to avoid verification problems, or we
don't unroll them because the verifier is supposed to handle it the way
it is written...

In early days, partial unrolling probably for better performance but
complete unrolling may exceed insn limit 4K. Later on the insn limit
is increased....

Anyway, I think you patch looks good based on current discussion.
I will ack it.

   - without "#pragma unroll" mean compiler will do profitability for full unroll,
     if compiler thinks full unroll is not profitable, there will be no unrolling.

So for gcc, even users saying '#pragma unroll', gcc still do
profitability test?
GCC doesn't support `#pragma unroll'.

Hence in my original patch the macro __pragma_unroll expands to nothing
with GCC.  That will lead to the compiler perhaps not unrolling the loop
even with -O2|-funroll-loops.

| profiler2.bpf.o |   2088 |   2050 |
- Without `#pragma unroll' and with -O2 we get 2088 instructions.
- With `#pragma loop unroll(disable)' and with -O2 we get 2050
    instructions.

Also surprising.

| profiler3.bpf.o |   4465 |   1690 |