On Tue, Jun 22, 2021 at 10:38 AM David Laight <David.Laight@xxxxxxxxxx> wrote: > > From: Nick Kossifidis > > Sent: 22 June 2021 02:08 > > > > Στις 2021-06-17 18:27, Matteo Croce έγραψε: > > > + > > > +void *__memset(void *s, int c, size_t count) > > > +{ > > > + union types dest = { .u8 = s }; > > > + > > > + if (count >= MIN_THRESHOLD) { > > > + const int bytes_long = BITS_PER_LONG / 8; > > > > You could make 'const int bytes_long = BITS_PER_LONG / 8;' > > What is wrong with sizeof (long) ? > ... Nothing, I guess that BITS_PER_LONG is just (sizeof(long) * 8) anyway > > > + unsigned long cu = (unsigned long)c; > > > + > > > + /* Compose an ulong with 'c' repeated 4/8 times */ > > > + cu |= cu << 8; > > > + cu |= cu << 16; > > > +#if BITS_PER_LONG == 64 > > > + cu |= cu << 32; > > > +#endif > > > + > > > > You don't have to create cu here, you'll fill dest buffer with 'c' > > anyway so after filling up enough 'c's to be able to grab an aligned > > word full of them from dest, you can just grab that word and keep > > filling up dest with it. > > That will be a lot slower - especially if run on something like x86. > A write-read of the same size is optimised by the store-load forwarder. > But the byte write, word read will have to go via the cache. > > You can just write: > cu = (unsigned long)c * 0x0101010101010101ull; > and let the compiler sort out the best way to generate the constant. > Interesting. I see that most compilers do an integer multiplication, is it faster than three shift and three or? clang on riscv generates even more instructions to create the immediate: unsigned long repeat_shift(int c) { unsigned long cu = (unsigned long)c; cu |= cu << 8; cu |= cu << 16; cu |= cu << 32; return cu; } unsigned long repeat_mul(int c) { return (unsigned long)c * 0x0101010101010101ull; } repeat_shift: slli a1, a0, 8 or a0, a0, a1 slli a1, a0, 16 or a0, a0, a1 slli a1, a0, 32 or a0, a0, a1 ret repeat_mul: lui a1, 4112 addiw a1, a1, 257 slli a1, a1, 16 addi a1, a1, 257 slli a1, a1, 16 addi a1, a1, 257 mul a0, a0, a1 ret > > > > > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > > > + /* Fill the buffer one byte at time until the destination > > > + * is aligned on a 32/64 bit boundary. > > > + */ > > > + for (; count && dest.uptr % bytes_long; count--) > > > > You could reuse & mask here instead of % bytes_long. > > > > > + *dest.u8++ = c; > > > +#endif > > > > I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your > > memcpy patch, is it worth it here ? To begin with riscv doesn't set it > > and even if it did we are talking about a loop that will run just a few > > times to reach the alignment boundary (worst case scenario it'll run 7 > > times), I don't think we gain much here, even for archs that have > > efficient unaligned access. > > With CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS it probably isn't worth > even checking the alignment. > While aligning the copy will be quicker for an unaligned buffer they > almost certainly don't happen often enough to worry about. > In any case you'd want to do a misaligned word write to the start > of the buffer - not separate byte writes. > Provided the buffer is long enough you can also do a misaligned write > to the end of the buffer before filling from the start. > I don't understand this one, a misaligned write here is ~30x slower than an aligned one because it gets trapped and emulated in SBI. How can this be convenient? > I suspect you may need either barrier() or use a ptr to packed > to avoid the perverted 'undefined behaviour' fubar.' > Which UB are you referring to? Regards, -- per aspera ad upstream