On Sat, Mar 4, 2023 at 12:31 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > Good news: gcc provides a lot of control as to how it inlines string > ops, most notably: > -mstringop-strategy=alg Note that any static decision is always going to be crap somewhere. You can make it do the "optimal" thing for any particular machine, but I consider that to be just garbage. What I would actually like to see is the compiler always generate an out-of-line call for the "big enough to not just do inline trivially" case, but do so with the "rep stosb/movsb" calling convention. Then we'd just mark those with objdump, and patch it up dynamically to either use the right out-of-line memset/memcpy function, *or* just replace it entirely with 'rep stosb' inline. Because the cores that do this right *do* exist, despite your hatred of the rep string instructions. At least Borislav claims that the modern AMD cores do better with 'rep stosb'. In particular, see what we do for 'clear_user()', where we effectively can do the above (because unlike memset, we control it entirely). See commit 0db7058e8e23 ("x86/clear_user: Make it faster"). Once we'd have that kind of infrastructure, we could then control exactly what 'memset()' does. And I note that we should probably have added Borislav to the cc when memset came up, exactly because he's been looking at it anyway. Even if AMD seems to have slightly different optimization rules than Intel cores probably do. But again, that only emphasizes the whole "we should not have a static choice here". Linus