Hi James,
On 03.10.2017 08:38, Marcin Nowakowski wrote:
The need for 64-bit signed length is unfortunate. Do you get decent
assembly and comparable/better performance on 32-bit if you just use len
and only decrement it in the loops? i.e.
- while ((length -= sizeof(uXX)) >= 0) {
+ while (len >= sizeof(uXX)) {
register uXX value = get_unaligned_leXX(p);
CRC32(crc, value, XX);
p += sizeof(uXX);
+ len -= sizeof(uXX);
}
That would be more readable too IMHO.
or maybe just do some pointer arithmetic like
const u8 *end = p + len;
while ((end - p) >= sizeof(uXX)) {
register uXX value = get_unaligned_leXX(p);
CRC32(crc, value, XX);
p += sizeof(uXX);
}
Thank you both for these suggestions. All solutions are very similar in
terms of the assembly produced, although the original code is the
smallest of all:
original vs James':
crc32_mips_le_hw 104 132 +28
vermagic 72 78 +6
chksumc_finup 40 44 +4
chksumc_digest 44 48 +4
chksum_finup 92 96 +4
chksum_digest 100 104 +4
original vs Jonas':
add/remove: 0/0 grow/shrink: 7/0 up/down: 90/0 (90)
function old new delta
crc32_mips_le_hw 104 148 +44
vermagic 72 78 +6
chksumc_finup 40 44 +4
chksumc_digest 44 48 +4
chksum_finup 92 96 +4
chksum_digest 100 104 +4
However - the key thing which is the processing loop is 6 instructions
long in all variants. It's only the pre/post loop processing that adds
the extra instructions so all these solutions should be roughly equal in
terms of performance.
I find James' code a bit more readable so I'll go with it and post an
updated patch.
The comparisons above were for 64-bit, where the difference is
negligible. On 32-bit builds, however, the difference is more significant:
original vs James':
function old new delta
vermagic 80 86 +6
crc32c_mips_le_hw 144 104 -40
crc32_mips_le_hw 144 104 -40
and the main crc loop is down from 9 to 5 instructions, so it's a
significant reduction of the loop size.
Marcin