On Tue, 4 Mar 2025 04:32:23 +0000 David Laight <david.laight.linux@xxxxxxxxx> wrote: .... > > For reference, GCC does much better with code gen, but only with the builtin: > > > > .L39: > > crc32q (%rax), %rbx # MEM[(long unsigned int *)p_40], tmp120 > > addq $8, %rax #, p > > cmpq %rcx, %rax # _37, p > > jne .L39 #, > > That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte > then you either need to unroll once (no point doing any more) or use > the loop that does negative offsets from the end. Thinking while properly awake the 1% difference isn't going to be a difference between the above and Clang's unrolled loop. Clang's loop will do 8 bytes every three clocks, if the above is slower it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely) which would be either 25% or 33% depending which way you measure it. ... > I'll find the code loop I use - machine isn't powered on at the moment. #include <linux/perf_event.h> #include <sys/mman.h> #include <sys/syscall.h> static int pmc_id; static void init_pmc(void) { static struct perf_event_attr perf_attr = { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES, .pinned = 1, }; struct perf_event_mmap_page *pc; int perf_fd; perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0); if (perf_fd < 0) { fprintf(stderr, "perf_event_open failed: errno %d\n", errno); exit(1); } pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0); if (pc == MAP_FAILED) { fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno); exit(1); } pmc_id = pc->index - 1; } static inline unsigned int rdpmc(id) { unsigned int low, high; // You need something to force the instruction pipeline to finish. // lfence might be enough. #ifndef NOFENCE asm volatile("mfence"); #endif asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id)); #ifndef NOFENCE asm volatile("mfence"); #endif // return low bits, counter might to 32 or 40 bits wide. return low; } The test code is then something like: #define PASSES 10 unsigned int ticks[PASSES]; unsigned int tick; unsigned int i; for (i = 0; i < PASSES; i++) { tick = rdpmc(pmc_id); test_fn(buf, len); ticks[i] = rdpmc(pmc_id) - tick; } for (i = 0; i < PASSES; i++) printf(" %5d", ticks[i]); Make sure the data is in the l1-cache (or that dominates). The values output for passes 2-10 are likely to be the same to within a clock or two. I probably tried to subtract an offset for an empty test_fn(). But you can easily work out the 'clocks per loop iteration' (which is what you are trying to measure) by measuring two separate loop lengths. I did find that sometimes running the program gave slow results. But it is usually very consistent. Needs to be run as root. Clearly a hardware interrupt will generate a very big number. But they don't happen. The copy I found was used for measuring ip checksum algorithms. Seems to output: $ sudo ./ipcsum 0 0 160 160 160 160 160 160 160 160 160 160 overhead 3637b4f0b942c3c4 682f 316 25 26 26 26 26 26 26 26 26 csum_partial 3637b4f0b942c3c4 682f 124 79 43 25 25 25 24 26 25 24 csum_partial_1 3637b4f0b942c3c4 682f 166 43 25 25 24 24 24 24 24 24 csum_new adc pair 3637b4f0b942c3c4 682f 115 21 21 21 21 21 21 21 21 21 adc_dec_2 3637b4f0b942c3c4 682f 97 34 31 23 24 24 24 24 24 23 adc_dec_4 3637b4f0b942c3c4 682f 39 33 34 21 21 21 21 21 21 21 adc_dec_8 3637b4f0b942c3c4 682f 81 52 49 52 49 26 25 27 25 26 adc_jcxz_2 3637b4f0b942c3c4 682f 62 46 24 24 24 24 24 24 24 24 adc_jcxz_4 3637b4f0b942c3c4 682f 224 40 21 21 23 23 23 23 23 23 adc_2_pair 3637b4f0b942c3c4 682f 42 36 37 22 22 22 22 22 22 22 adc_4_pair_old 3637b4f0b942c3c4 682f 42 37 34 41 23 23 23 23 23 23 adc_4_pair 3637b4f0b942c3c4 682f 122 19 20 19 18 19 18 19 18 19 adcx_adox bef7a78a9 682f 104 51 30 30 30 30 30 30 30 30 add_c_16 bef7a78a9 682f 143 50 50 27 27 27 27 27 27 27 add_c_32 6ef7a78ae 682f 103 91 45 34 34 34 35 34 34 34 add_c_high I don't think the current one is in there - IIRC it is as fast as the adcx_adox one but more portable. David