On Mon, Sep 13, 2021 at 1:35 PM David Laight <David.Laight@xxxxxxxxxx> wrote: > > > > These ended up getting rejected by Linus, so I'm going to hold off on > > > this for now. If they're really out of lib/ then I'll take the C > > > routines in arch/riscv, but either way it's an issue for the next > > > release. > > Agree, we should take the C routine in arch/riscv for common > > implementation. If any vendor what custom implementation they could > > use the alternative framework in errata for string operations. > > I though the asm ones were significantly faster because > they were less affected by read latency. > > (But they were horribly broken for misaligned transfers.) > I can get the same exact performance (and a very similar machine code) in C with this on top of the C memset implementation: --- a/arch/riscv/lib/string.c +++ b/arch/riscv/lib/string.c @@ -112,9 +112,12 @@ EXPORT_SYMBOL(__memmove); void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove); EXPORT_SYMBOL(memmove); +#define BATCH 4 + void *__memset(void *s, int c, size_t count) { union types dest = { .as_u8 = s }; + int i; if (count >= MIN_THRESHOLD) { unsigned long cu = (unsigned long)c; @@ -138,8 +141,12 @@ void *__memset(void *s, int c, size_t count) } /* Copy using the largest size allowed */ - for (; count >= BYTES_LONG; count -= BYTES_LONG) - *dest.as_ulong++ = cu; + for (; count >= BYTES_LONG * BATCH; count -= BYTES_LONG * BATCH) { +#pragma GCC unroll 4 + for (i = 0; i < BATCH; i++) + dest.as_ulong[i] = cu; + dest.as_ulong += BATCH; + } } On the BeagleV the memset speed with the different batch size are: 1 (stock): 267 Mb/s 2: 272 Mb/s 4: 276 Mb/s 8: 276 Mb/s The problem with biggest batch size is that it will fallback to a single byte copy if the buffers are too small. Regards, -- per aspera ad upstream