On Thu, Sep 14, 2023 at 1:47 AM David Laight <David.Laight@xxxxxxxxxx> wrote: > > From: Geert Uytterhoeven > > Sent: 14 September 2023 08:33 > ... > > > > rzfive: > > > > cpu0: Ratio of byte access time to unaligned word access is > > > > 1.05, unaligned accesses are fast > > > > > > Hrm, I'm a little surprised to be seeing this number come out so close > > > to 1. If you reboot a few times, what kind of variance do you get on > > > this? > > > > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries) > > Would that match zero overhead unless the access crosses a > cache line boundary? > (I can't remember whether the test is using increasing addresses.) Yes, the test does use increasing addresses, it copies across 4 pages. We start with a warmup, so caching effects beyond L1 are largely not taken into account. > > ... > > > > vexriscv/orangecrab: > > > > > > > > cpu0: Ratio of byte access time to unaligned word access is > > > > 0.00, unaligned accesses are slow > > > > cpu0: Ratio of byte access time to unaligned word access is 0.00417, > > unaligned accesses are slow > > > > > > I am a bit surprised by the near-zero values. Are these expected? > > > > > > This could be expected, if firmware is trapping the unaligned accesses > > > and coming out >100x slower than a native access. If you're interested > > > in getting a little more resolution, you could try to print a few more > > > decimal places with something like (sorry gmail mangles the whitespace > > > on this): > > I'd expect one of three possible values: > - 1.0x: Basically zero cost except for cache line/page boundaries. > - ~2: Hardware does two reads and merges the values. > - >100: Trap fixed up in software. > > I'd think the '2' case could be considered fast. > You only need to time one access to see if it was a fault. We're comparing misaligned word accesses with byte accesses of the same total size. So 1.0 means a misaligned load is basically no different from 8 byte loads. The goal was to help people that are forced to do odd loads and stores decide whether they are better off moving by bytes or by misaligned words. (In contrast, the answer to "should I do a misaligned word load or an aligned word load" is generally always "do the aligned one if you can", so comparing those two things didn't seem as useful). We opted for 1.0 as a cutoff, since even at 1.05, you get a boost from doing misaligned word loads over byte copies. I asked about the variance because I don't want to see machines that change their mind from boot to boot. I originally considered trying to create a "gray zone" where the answer goes back to UNKNOWN, but in the end that just moves the fiddly point rather than really eliminating it. You're right that in theory we just need one perfect access to test, but testing only once makes it susceptible to hiccups. We went with doing it many times in a fixed period and taking the minimum to hopefully remove noise like NMI-like things, branch prediction misses, or cache eviction. Geert, Thanks for providing the numbers. Yes, we could add another digit to the print. Though if you already know you're at least 100x slower, maybe knowing exactly how much slower isn't super meaningful, just very much avoid unaligned accesses on these systems :). Hopefully over time the number of systems like this will dwindle. -Evan