Re: [PATCH v3 3/3] riscv: optimized memset

Matteo Croce <mcroce@xxxxxxxxxxxxxxxxxxx> · Wed, 23 Jun 2021 03:14:36 +0200

On Tue, Jun 22, 2021 at 10:38 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> From: Nick Kossifidis
> > Sent: 22 June 2021 02:08
> >
> > Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > > +
> > > +void *__memset(void *s, int c, size_t count)
> > > +{
> > > +   union types dest = { .u8 = s };
> > > +
> > > +   if (count >= MIN_THRESHOLD) {
> > > +           const int bytes_long = BITS_PER_LONG / 8;
> >
> > You could make 'const int bytes_long = BITS_PER_LONG / 8;'
>
> What is wrong with sizeof (long) ?
> ...

Nothing, I guess that BITS_PER_LONG is just (sizeof(long) * 8) anyway

> > > +           unsigned long cu = (unsigned long)c;
> > > +
> > > +           /* Compose an ulong with 'c' repeated 4/8 times */
> > > +           cu |= cu << 8;
> > > +           cu |= cu << 16;
> > > +#if BITS_PER_LONG == 64
> > > +           cu |= cu << 32;
> > > +#endif
> > > +
> >
> > You don't have to create cu here, you'll fill dest buffer with 'c'
> > anyway so after filling up enough 'c's to be able to grab an aligned
> > word full of them from dest, you can just grab that word and keep
> > filling up dest with it.
>
> That will be a lot slower - especially if run on something like x86.
> A write-read of the same size is optimised by the store-load forwarder.
> But the byte write, word read will have to go via the cache.
>
> You can just write:
>         cu = (unsigned long)c * 0x0101010101010101ull;
> and let the compiler sort out the best way to generate the constant.
>

Interesting. I see that most compilers do an integer multiplication,
is it faster than three shift and three or?

clang on riscv generates even more instructions to create the immediate:

unsigned long repeat_shift(int c)
{
  unsigned long cu = (unsigned long)c;
  cu |= cu << 8;
  cu |= cu << 16;
  cu |= cu << 32;

  return cu;
}

unsigned long repeat_mul(int c)
{
  return (unsigned long)c * 0x0101010101010101ull;
}

repeat_shift:
  slli a1, a0, 8
  or a0, a0, a1
  slli a1, a0, 16
  or a0, a0, a1
  slli a1, a0, 32
  or a0, a0, a1
  ret

repeat_mul:
  lui a1, 4112
  addiw a1, a1, 257
  slli a1, a1, 16
  addi a1, a1, 257
  slli a1, a1, 16
  addi a1, a1, 257
  mul a0, a0, a1
  ret

> >
> > > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > > +           /* Fill the buffer one byte at time until the destination
> > > +            * is aligned on a 32/64 bit boundary.
> > > +            */
> > > +           for (; count && dest.uptr % bytes_long; count--)
> >
> > You could reuse & mask here instead of % bytes_long.
> >
> > > +                   *dest.u8++ = c;
> > > +#endif
> >
> > I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> > memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> > and even if it did we are talking about a loop that will run just a few
> > times to reach the alignment boundary (worst case scenario it'll run 7
> > times), I don't think we gain much here, even for archs that have
> > efficient unaligned access.
>
> With CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS it probably isn't worth
> even checking the alignment.
> While aligning the copy will be quicker for an unaligned buffer they
> almost certainly don't happen often enough to worry about.
> In any case you'd want to do a misaligned word write to the start
> of the buffer - not separate byte writes.
> Provided the buffer is long enough you can also do a misaligned write
> to the end of the buffer before filling from the start.
>

I don't understand this one, a misaligned write here is ~30x slower
than an aligned one because it gets trapped and emulated in SBI.
How can this be convenient?

> I suspect you may need either barrier() or use a ptr to packed
> to avoid the perverted 'undefined behaviour' fubar.'
>

Which UB are you referring to?

Regards,
--
per aspera ad upstream