On Tue, Jun 15, 2021 at 4:57 PM David Laight <David.Laight@xxxxxxxxxx> wrote: > > From: Matteo Croce > > Sent: 15 June 2021 03:38 > > > > Write a C version of memcpy() which uses the biggest data size allowed, > > without generating unaligned accesses. > > I'm surprised that the C loop: > > > + for (; count >= bytes_long; count -= bytes_long) > > + *d.ulong++ = *s.ulong++; > > ends up being faster than the ASM 'read lots' - 'write lots' loop. I believe that's because the assembly version has some unaligned access cases, which end up being trap-n-emulated in the OpenSBI firmware, and that is a big overhead. > > Especially since there was an earlier patch to convert > copy_to/from_user() to use the ASM 'read lots' - 'write lots' loop > instead of a tight single register copy loop. > > I'd also guess that the performance needs to be measured on > different classes of riscv cpu. > > A simple cpu will behave differently to one that can execute > multiple instructions per clock. > Any form of 'out of order' execution also changes things. > The other big change is whether the cpu can to a memory > read and write in the same clock. > > I'd guess that riscv exist with some/all of those features. Regards, Bin