Re: [PATCH v2] x86/crc32: use builtins to improve code generation

David Laight <david.laight.linux@xxxxxxxxx> · Tue, 4 Mar 2025 20:52:52 +0000

On Tue, 4 Mar 2025 04:32:23 +0000
David Laight <david.laight.linux@xxxxxxxxx> wrote:

....
> > For reference, GCC does much better with code gen, but only with the builtin:
> > 
> > .L39:
> >         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
> >         addq    $8, %rax        #, p
> >         cmpq    %rcx, %rax      # _37, p
> >         jne     .L39    #,  
> 
> That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> then you either need to unroll once (no point doing any more) or use
> the loop that does negative offsets from the end.

Thinking while properly awake the 1% difference isn't going to be a
difference between the above and Clang's unrolled loop.
Clang's loop will do 8 bytes every three clocks, if the above is slower
it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
which would be either 25% or 33% depending which way you measure it.

...
> I'll find the code loop I use - machine isn't powered on at the moment.

#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>

static int pmc_id;
static void init_pmc(void)
{
        static struct perf_event_attr perf_attr = {
                .type = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .pinned = 1,
        };
        struct perf_event_mmap_page *pc;

        int perf_fd;
        perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
        if (perf_fd < 0) {
                fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
                exit(1);
        }
        pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
        if (pc == MAP_FAILED) {
                fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
                exit(1);
        }
        pmc_id = pc->index - 1;
}

static inline unsigned int rdpmc(id)
{
        unsigned int low, high;

// You need something to force the instruction pipeline to finish.
// lfence might be enough.
#ifndef NOFENCE
        asm volatile("mfence");
#endif
        asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
#ifndef NOFENCE
        asm volatile("mfence");
#endif

        // return low bits, counter might to 32 or 40 bits wide.
        return low;
}

The test code is then something like:
#define PASSES 10
        unsigned int ticks[PASSES];
        unsigned int tick;
        unsigned int i;

        for (i = 0; i < PASSES; i++) {
                tick = rdpmc(pmc_id);
                test_fn(buf, len);
                ticks[i] = rdpmc(pmc_id) - tick;
        }

        for (i = 0; i < PASSES; i++)
                printf(" %5d", ticks[i]);

Make sure the data is in the l1-cache (or that dominates).
The values output for passes 2-10 are likely to be the same to within
a clock or two.
I probably tried to subtract an offset for an empty test_fn().
But you can easily work out the 'clocks per loop iteration'
(which is what you are trying to measure) by measuring two separate
loop lengths.

I did find that sometimes running the program gave slow results.
But it is usually very consistent.
Needs to be run as root.
Clearly a hardware interrupt will generate a very big number.
But they don't happen.

The copy I found was used for measuring ip checksum algorithms.
Seems to output:
$ sudo ./ipcsum 
                0     0   160   160   160   160   160   160   160   160   160   160  overhead
 3637b4f0b942c3c4  682f   316    25    26    26    26    26    26    26    26    26  csum_partial
 3637b4f0b942c3c4  682f   124    79    43    25    25    25    24    26    25    24  csum_partial_1
 3637b4f0b942c3c4  682f   166    43    25    25    24    24    24    24    24    24  csum_new adc pair
 3637b4f0b942c3c4  682f   115    21    21    21    21    21    21    21    21    21  adc_dec_2
 3637b4f0b942c3c4  682f    97    34    31    23    24    24    24    24    24    23  adc_dec_4
 3637b4f0b942c3c4  682f    39    33    34    21    21    21    21    21    21    21  adc_dec_8
 3637b4f0b942c3c4  682f    81    52    49    52    49    26    25    27    25    26  adc_jcxz_2
 3637b4f0b942c3c4  682f    62    46    24    24    24    24    24    24    24    24  adc_jcxz_4
 3637b4f0b942c3c4  682f   224    40    21    21    23    23    23    23    23    23  adc_2_pair
 3637b4f0b942c3c4  682f    42    36    37    22    22    22    22    22    22    22  adc_4_pair_old
 3637b4f0b942c3c4  682f    42    37    34    41    23    23    23    23    23    23  adc_4_pair
 3637b4f0b942c3c4  682f   122    19    20    19    18    19    18    19    18    19  adcx_adox
        bef7a78a9  682f   104    51    30    30    30    30    30    30    30    30  add_c_16
        bef7a78a9  682f   143    50    50    27    27    27    27    27    27    27  add_c_32
        6ef7a78ae  682f   103    91    45    34    34    34    35    34    34    34  add_c_high

I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
but more portable.

	David