On Tue, Jun 22, 2021 at 3:07 AM Nick Kossifidis <mick@xxxxxxxxxxxx> wrote: > > Στις 2021-06-17 18:27, Matteo Croce έγραψε: > > + > > +void *__memset(void *s, int c, size_t count) > > +{ > > + union types dest = { .u8 = s }; > > + > > + if (count >= MIN_THRESHOLD) { > > + const int bytes_long = BITS_PER_LONG / 8; > > You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const > int mask = bytes_long - 1;' from your memcpy patch visible to memset as > well (static const...) and use them here (mask would make more sense to > be named as word_mask). > I'll do > > + unsigned long cu = (unsigned long)c; > > + > > + /* Compose an ulong with 'c' repeated 4/8 times */ > > + cu |= cu << 8; > > + cu |= cu << 16; > > +#if BITS_PER_LONG == 64 > > + cu |= cu << 32; > > +#endif > > + > > You don't have to create cu here, you'll fill dest buffer with 'c' > anyway so after filling up enough 'c's to be able to grab an aligned > word full of them from dest, you can just grab that word and keep > filling up dest with it. > I tried that, but this way I have to wait 8 bytes more before starting the memset. And, the machine code needed to generate 'cu' is just 6 instructions on riscv: slli a5,a0,8 or a5,a5,a0 slli a0,a5,16 or a0,a0,a5 slli a5,a0,32 or a0,a5,a0 so probably it's not worth it. > > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > > + /* Fill the buffer one byte at time until the destination > > + * is aligned on a 32/64 bit boundary. > > + */ > > + for (; count && dest.uptr % bytes_long; count--) > > You could reuse & mask here instead of % bytes_long. > Sure, even if the machine code will be the same. > > + *dest.u8++ = c; > > +#endif > > I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your > memcpy patch, is it worth it here ? To begin with riscv doesn't set it > and even if it did we are talking about a loop that will run just a few > times to reach the alignment boundary (worst case scenario it'll run 7 > times), I don't think we gain much here, even for archs that have > efficient unaligned access. It doesn't _now_, but maybe in the future we will have a CPU which handles unaligned accesses correctly! -- per aspera ad upstream