On Tue, Jan 17, 2023 at 12:10 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > That said, clang still generates more register pressure than gcc, > causing the function prologue and epilogue to be rather bigger > (pushing and popping six registers, as opposed to gcc that only needs > three) .. and at least part of that is the same thing with the bad byte mask generation (see that "clang *still* messes up" link for details). Basically, the byte mask is computed by mask = bytemask_from_count(tcount); where we have #define bytemask_from_count(cnt) (~(~0ul << (cnt)*8)) and clang tries very very hard to avoid that "multiply by 8", so instead it keeps a shadow copy of that "(cnt)*8" value in the loop. That is wrong for a couple of reasons: (a) it adds register pressure for no good reason (b) when you shift left by that value, only the low 6 bits of that value matters And guess how that "tcount" is updated? It's this: tcount -= sizeof(unsigned long); in the loop, and thus the update of that shadow value of "(cnt)*8" is done as addl $-64, %ecx inside that loop. This is truly stupid and wasted work, because the low 6 bits of the value - remember, the only part that matters - DOES NOT CHANGE when you do that. So clang has decided that it needs to (a) avoid the "expensive" multiply-by-8 at the end by turning it into a repeated "add $-64" inside the loop (b) added register pressure and one extra instruction inside the loop (c) not realized that that extra instruction doesn't actually *do* anything, because it only affects the bits that don't actually matter in the end. which is all kind of silly, wouldn't you agree. Every single step there was pointless. But with my other simplifications, the fact that clang does these extra things is no longer all that noticeable. It *used* to be a horrible disaster because the extra register pressure ended up meaning that you had spills and all kinds of nastiness. Now the function is simple enough that even with the extra register pressure, there's no need for spills. .. until you look at the 32-bit version, which still needs spills. Gcc does too, but clang just makes it worse by having the extra pointless shadow variable. If I cared about 32-bit, I might write up a bugzilla entry. As it is, it's just "clang tries to be clever, and in the process is actually being stupid". Linus