Re: [PATCH v2] x86/crc32: use builtins to improve code generation

Eric Biggers <ebiggers@xxxxxxxxxx> · Tue, 4 Mar 2025 13:52:52 -0800

On Tue, Mar 04, 2025 at 08:52:52PM +0000, David Laight wrote:
> On Tue, 4 Mar 2025 04:32:23 +0000
> David Laight <david.laight.linux@xxxxxxxxx> wrote:
> 
> ....
> > > For reference, GCC does much better with code gen, but only with the builtin:
> > > 
> > > .L39:
> > >         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
> > >         addq    $8, %rax        #, p
> > >         cmpq    %rcx, %rax      # _37, p
> > >         jne     .L39    #,  
> > 
> > That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> > then you either need to unroll once (no point doing any more) or use
> > the loop that does negative offsets from the end.
> 
> Thinking while properly awake the 1% difference isn't going to be a
> difference between the above and Clang's unrolled loop.
> Clang's loop will do 8 bytes every three clocks, if the above is slower
> it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
> which would be either 25% or 33% depending which way you measure it.
> 
> ...
> > I'll find the code loop I use - machine isn't powered on at the moment.
> 
> #include <linux/perf_event.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> 
> static int pmc_id;
> static void init_pmc(void)
> {
>         static struct perf_event_attr perf_attr = {
>                 .type = PERF_TYPE_HARDWARE,
>                 .config = PERF_COUNT_HW_CPU_CYCLES,
>                 .pinned = 1,
>         };
>         struct perf_event_mmap_page *pc;
> 
>         int perf_fd;
>         perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
>         if (perf_fd < 0) {
>                 fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
>                 exit(1);
>         }
>         pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
>         if (pc == MAP_FAILED) {
>                 fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
>                 exit(1);
>         }
>         pmc_id = pc->index - 1;
> }
> 
> static inline unsigned int rdpmc(id)
> {
>         unsigned int low, high;
> 
> // You need something to force the instruction pipeline to finish.
> // lfence might be enough.
> #ifndef NOFENCE
>         asm volatile("mfence");
> #endif
>         asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
> #ifndef NOFENCE
>         asm volatile("mfence");
> #endif
> 
>         // return low bits, counter might to 32 or 40 bits wide.
>         return low;
> }
> 
> The test code is then something like:
> #define PASSES 10
>         unsigned int ticks[PASSES];
>         unsigned int tick;
>         unsigned int i;
> 
>         for (i = 0; i < PASSES; i++) {
>                 tick = rdpmc(pmc_id);
>                 test_fn(buf, len);
>                 ticks[i] = rdpmc(pmc_id) - tick;
>         }
> 
>         for (i = 0; i < PASSES; i++)
>                 printf(" %5d", ticks[i]);
> 
> Make sure the data is in the l1-cache (or that dominates).
> The values output for passes 2-10 are likely to be the same to within
> a clock or two.
> I probably tried to subtract an offset for an empty test_fn().
> But you can easily work out the 'clocks per loop iteration'
> (which is what you are trying to measure) by measuring two separate
> loop lengths.
> 
> I did find that sometimes running the program gave slow results.
> But it is usually very consistent.
> Needs to be run as root.
> Clearly a hardware interrupt will generate a very big number.
> But they don't happen.
> 
> The copy I found was used for measuring ip checksum algorithms.
> Seems to output:
> $ sudo ./ipcsum 
>                 0     0   160   160   160   160   160   160   160   160   160   160  overhead
>  3637b4f0b942c3c4  682f   316    25    26    26    26    26    26    26    26    26  csum_partial
>  3637b4f0b942c3c4  682f   124    79    43    25    25    25    24    26    25    24  csum_partial_1
>  3637b4f0b942c3c4  682f   166    43    25    25    24    24    24    24    24    24  csum_new adc pair
>  3637b4f0b942c3c4  682f   115    21    21    21    21    21    21    21    21    21  adc_dec_2
>  3637b4f0b942c3c4  682f    97    34    31    23    24    24    24    24    24    23  adc_dec_4
>  3637b4f0b942c3c4  682f    39    33    34    21    21    21    21    21    21    21  adc_dec_8
>  3637b4f0b942c3c4  682f    81    52    49    52    49    26    25    27    25    26  adc_jcxz_2
>  3637b4f0b942c3c4  682f    62    46    24    24    24    24    24    24    24    24  adc_jcxz_4
>  3637b4f0b942c3c4  682f   224    40    21    21    23    23    23    23    23    23  adc_2_pair
>  3637b4f0b942c3c4  682f    42    36    37    22    22    22    22    22    22    22  adc_4_pair_old
>  3637b4f0b942c3c4  682f    42    37    34    41    23    23    23    23    23    23  adc_4_pair
>  3637b4f0b942c3c4  682f   122    19    20    19    18    19    18    19    18    19  adcx_adox
>         bef7a78a9  682f   104    51    30    30    30    30    30    30    30    30  add_c_16
>         bef7a78a9  682f   143    50    50    27    27    27    27    27    27    27  add_c_32
>         6ef7a78ae  682f   103    91    45    34    34    34    35    34    34    34  add_c_high
> 
> I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
> but more portable.

I guess this thread has turned into one where everyone has to weigh in :-)

Just to summarize my thoughts on the whole thread:

- IMO we should not use the crc32 intrinsics yet, as there are too many issues
  including no stability guarantee for the builtins (or else having to figure
  out how to include immintrin.h in the kernel to get the stable functions),
  having to set the crc32 target with the correct scope, dealing with old
  compiler versions that don't support crc32, and unhelpful loop unrolling.

- https://lore.kernel.org/r/20250210210741.471725-1-ebiggers@xxxxxxxxxx already
  fixed the spilling to the stack with clang.  It does result in a separate mov
  from memory instead of taking advantage of the mem operand support.  But that
  should not make much of a difference.

- crc_kunit already includes a benchmark.  I recommend using that for
  benchmarking the kernel's CRC code.  Sure, one can do a more precise analysis
  with performance counters, but IMO it's generally unnecessary.

- The 4-2-1 step-down is a good idea, and in fact crc32c-3way.S (which handles
  lengths >= 512 bytes) already does exactly that for tail handling.  I sent out
  https://lore.kernel.org/r/20250304213216.108925-1-ebiggers@xxxxxxxxxx which
  adds it to the C code (which handles lengths < 512 bytes) too.

- Moving all this to assembly is still attractive, especially considering that
  lengths >= 512 bytes are already handled in assembly in crc32c-3way.S, and
  essentially the exact code we want is already in that file (it's used to
  handle anything left over from the 3-way processing).  But, I think we'll keep
  the C (with inline asm for just the crc32 instructions) version too for now.
  It's a bit more approachable, and it's nice to avoid an extra function call to
  a .S file.