Re: [PATCH] Performance Improvement in CRC16 Calculations.

"Martin K. Petersen" <martin.petersen@xxxxxxxxxx> · Thu, 16 Aug 2018 23:20:05 -0400

> With regard to your comment about slice (table ?) size, that is
> partially addressed by a kernel build time option shown in the above
> patch. That could be taken a bit further with a sysfs knob (where ?)
> to reduce the effective table size from that which the kernel is built
> with. To increase the size of the table would imply fetching some more
> heap and having an algorithm that could generate the extra part of
> that table required.

I am not a big fan of punting the decision to whoever compiles the
kernel to pick a number between 1 and 11 ("this CRC calculation is one
louder"). I would prefer to find a reasonable compromise between
bandwidth and cache thrashing side effects instead of overwhelming
people with build time choices and runtime tunables.

Almost everyone is running either Tim's PCLMULQDQ version or using IP
checksum for DIX. The software T10 CRC table implementation is mainly
there as a reference. I don't know of any production environments using
the table-based T10 CRC.

I don't have a problem making the code genuinely useful so it can be
leveraged by processors without hardware CRC acceleration capability.
But there needs to be some solid data guiding this decision so I'm
looking forward to see what WDC has in store.

Our results definitely matched Christophe's in that larger slice-by-N
are not always a win. And "faster" isn't automatically "better" from an
application performance perspective. With the caveat that our
measurements were done about 10 years ago and I'm sure we've come a long
way with processors and caches since then. So the results should be
interesting...

-- 
Martin K. Petersen	Oracle Linux Engineering