RE: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

David Laight <David.Laight@xxxxxxxxxx> · Fri, 15 Sep 2023 07:57:22 +0000

From: Evan Green
> Sent: 14 September 2023 17:37
> 
> On Thu, Sep 14, 2023 at 8:55 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> >
> > From: Evan Green
> > > Sent: 14 September 2023 16:01
> > >
> > > On Thu, Sep 14, 2023 at 1:47 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > > >
> > > > From: Geert Uytterhoeven
> > > > > Sent: 14 September 2023 08:33
> > > > ...
> > > > > > >     rzfive:
> > > > > > >         cpu0: Ratio of byte access time to unaligned word access is
> > > > > > > 1.05, unaligned accesses are fast
> > > > > >
> > > > > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > > > > to 1. If you reboot a few times, what kind of variance do you get on
> > > > > > this?
> > > > >
> > > > > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
> > > >
> > > > Would that match zero overhead unless the access crosses a
> > > > cache line boundary?
> > > > (I can't remember whether the test is using increasing addresses.)
> > >
> > > Yes, the test does use increasing addresses, it copies across 4 pages.
> > > We start with a warmup, so caching effects beyond L1 are largely not
> > > taken into account.
> >
> > That seems entirely excessive.
> > If you want to avoid data cache issues (which probably do)
> > then just repeating a single access would almost certainly
> > suffice.
> > Repeatedly using a short buffer (say 256 bytes) won't add
> > much loop overhead.
> > Although you may want to do a test that avoids transfers
> > that cross cache line and especially page boundaries.
> > Either of those could easily be much slower than a read
> > that is entirely within a cache line.
> 
> We won't be faulting on any of these pages, and they should remain in
> the TLB, so I don't expect many page boundary specific effects. If
> there is a steep penalty for misaligned loads across a cache line,
> such that it's worse than doing byte accesses, I want the test results
> to be dinged for that.

That is an entirely different issue.

Are you absolutely certain that the reason 8 byte loads take
as long as a 64-bit mis-aligned load isn't because the entire
test is limited by L1 cache fills?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)