On Tue, Aug 18, 2020 at 9:58 PM Nick Desaulniers <ndesaulniers@xxxxxxxxxx> wrote: On Tue, Aug 18, 2020 at 12:25 PM Nick Desaulniers <ndesaulniers@xxxxxxxxxx> wrote: > > On Tue, Aug 18, 2020 at 12:19 PM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > And honestly, a compiler that uses 'bcmp' is just broken. WTH? It's > > the year 2020, we don't use bcmp. It's that simple. Fix your damn > > broken compiler and use memcmp. The argument that memcmp is more > > expensive than bcmp is garbage legacy thinking from four decades ago. > > > > It's likely the other way around, where people have actually spent > > time on memcmp, but not on bcmp. > > > > Linus > > You'll have to ask Clement about that. I'm not sure I ever saw the > "faster bcmp than memcmp" implementation, but I was told "it exists" > when I asked for a revert when all of our kernel builds went red. If **is** possible to make bcmp much faster then memcmp. We have one such implementation internally (it's scheduled to be released as part of llvm-libc some time this year), but most libc implementations just alias to memcmp. Below is a graph showing the impact of releasing this compiler optimization with our optimized bcmp on the google fleet (the cumulative memcmp+bcmp usage of all programs running on google datacenters, including the kernel). Scale and dates have been redacted for obvious reasons, but note that the graph starts at y=0, so you can compare the values relative to each other. Note how as memcmp is progressively being replaced by bcmp (more and more programs being recompiled with the compiler patch), the cumulative usage of memory comparison drops significantly. https://drive.google.com/file/d/1p8z1ilw2xaAJEnx_5eu-vflp3tEOv0qY/view?usp=sharing The reasons why bcmp can be faster are: - typical libc implementations use the hardware to its full capacity, e.g. for bcmp we can use vector loads and compares, which can process up to 64 bytes (avx512) in one instruction. It's harder to implement memcmp with these for little-endian architectures as there is no vector bswap. Because the kernel only uses GPRs I can see how that might not perfectly fit the kernel use case. But the kernel really is a special case, the compiler is written for most programs, not specifically for the kernel, and most programs should benefit from this optimization. - bcmp() does not have to look at the bytes in order, e.g. it can look at the first and last . This is useful when comparing buffers that have common prefixes (as happens in mostly sorted containers, and we have data that shows that this is a quite common instance). > Also, to Clement's credit, every patch I've ever seen from Clement is > backed up by data; typically fleetwide profiles at Google. "we spend > a lot of time in memcmp, particularly comparing the result against > zero and no other value; hmm...how do we spend less time in > memcmp...oh, well there's another library function with slightly > different semantics we can call instead." I don't think anyone would > consider the optimization batshit crazy given the number of cycles > saved across the fleet. That an embedded project didn't provide an > implementation, is a footnote that can be fixed in the embedded > project, either by using -ffreestanding or -fno-builtin-bcmp, which is > what this series proposes to do.