Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

Evan Green <evan@xxxxxxxxxxxx> · Thu, 14 Sep 2023 08:01:01 -0700

On Thu, Sep 14, 2023 at 1:47 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> From: Geert Uytterhoeven
> > Sent: 14 September 2023 08:33
> ...
> > > >     rzfive:
> > > >         cpu0: Ratio of byte access time to unaligned word access is
> > > > 1.05, unaligned accesses are fast
> > >
> > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > to 1. If you reboot a few times, what kind of variance do you get on
> > > this?
> >
> > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
>
> Would that match zero overhead unless the access crosses a
> cache line boundary?
> (I can't remember whether the test is using increasing addresses.)

Yes, the test does use increasing addresses, it copies across 4 pages.
We start with a warmup, so caching effects beyond L1 are largely not
taken into account.

>
> ...
> > > >     vexriscv/orangecrab:
> > > >
> > > >         cpu0: Ratio of byte access time to unaligned word access is
> > > > 0.00, unaligned accesses are slow
> >
> > cpu0: Ratio of byte access time to unaligned word access is 0.00417,
> > unaligned accesses are slow
> >
> > > > I am a bit surprised by the near-zero values.  Are these expected?
> > >
> > > This could be expected, if firmware is trapping the unaligned accesses
> > > and coming out >100x slower than a native access. If you're interested
> > > in getting a little more resolution, you could try to print a few more
> > > decimal places with something like (sorry gmail mangles the whitespace
> > > on this):
>
> I'd expect one of three possible values:
> - 1.0x: Basically zero cost except for cache line/page boundaries.
> - ~2: Hardware does two reads and merges the values.
> - >100: Trap fixed up in software.
>
> I'd think the '2' case could be considered fast.
> You only need to time one access to see if it was a fault.

We're comparing misaligned word accesses with byte accesses of the
same total size. So 1.0 means a misaligned load is basically no
different from 8 byte loads. The goal was to help people that are
forced to do odd loads and stores decide whether they are better off
moving by bytes or by misaligned words. (In contrast, the answer to
"should I do a misaligned word load or an aligned word load" is
generally always "do the aligned one if you can", so comparing those
two things didn't seem as useful).

We opted for 1.0 as a cutoff, since even at 1.05, you get a boost from
doing misaligned word loads over byte copies. I asked about the
variance because I don't want to see machines that change their mind
from boot to boot. I originally considered trying to create a "gray
zone" where the answer goes back to UNKNOWN, but in the end that just
moves the fiddly point rather than really eliminating it.

You're right that in theory we just need one perfect access to test,
but testing only once makes it susceptible to hiccups. We went with
doing it many times in a fixed period and taking the minimum to
hopefully remove noise like NMI-like things, branch prediction misses,
or cache eviction.

Geert,
Thanks for providing the numbers. Yes, we could add another digit to
the print. Though if you already know you're at least 100x slower,
maybe knowing exactly how much slower isn't super meaningful, just
very much avoid unaligned accesses on these systems :). Hopefully over
time the number of systems like this will dwindle.

-Evan