On Tue, Jan 16, 2024 at 02:51:21PM -0400, Jason Gunthorpe wrote: > Hey Catalin, > > I'm just revising this and I'm wondering if you know why ARM64 has this: > > #define __raw_writeq __raw_writeq > static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr) > { > asm volatile("str %x0, [%1]" : : "rZ" (val), "r" (addr)); > } > > Instead of > > #define __raw_writeq __raw_writeq > static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr) > { > asm volatile("str %x0, %1" : : "rZ" (val), "m" (*(volatile u64 *)addr)); > } > > ?? Like x86 has. I believe this is for the same reason as doing so in all of our other IO accessors. We've deliberately ensured that our IO accessors use a single base register with no offset as this is the only form that HW can represent in ESR_ELx.ISS.SRT when reporting a stage-2 abort, which a hypervisor may use for emulating IO. Mark. > > The codegen for a 64 byte unrolled copy loop is way better with "m" on gcc: > > "r" constraint (gcc 13.2.0): > > .L3: > ldr x3, [x1] > str x3, [x0] > ldr x3, [x1, 8] > add x4, x0, 8 > str x3, [x4] > ldr x3, [x1, 16] > add x4, x0, 16 > str x3, [x4] > ldr x3, [x1, 24] > add x4, x0, 24 > str x3, [x4] > ldr x3, [x1, 32] > add x4, x0, 32 > str x3, [x4] > ldr x3, [x1, 40] > add x4, x0, 40 > str x3, [x4] > ldr x3, [x1, 48] > add x4, x0, 48 > str x3, [x4] > ldr x3, [x1, 56] > add x4, x0, 56 > str x3, [x4] > add x1, x1, 64 > add x0, x0, 64 > cmp x2, x1 > bhi .L3 > > "m" constraint: > > .L3: > ldp x10, x9, [x1] > ldp x8, x7, [x1, 16] > ldp x6, x5, [x1, 32] > ldp x4, x3, [x1, 48] > str x10, [x0] > str x9, [x0, 8] > str x8, [x0, 16] > str x7, [x0, 24] > str x6, [x0, 32] > str x5, [x0, 40] > str x4, [x0, 48] > str x3, [x0, 56] > add x1, x1, 64 > add x0, x0, 64 > cmp x2, x1 > bhi .L3 > > clang 17 doesn't do any better either way, it doesn't seem to do > anything with 'm', but I guess it could.. > > clang 17 (either): > > .LBB0_2: // =>This Inner Loop Header: Depth=1 > ldp x9, x10, [x1] > add x14, x0, #8 > add x18, x0, #40 > ldp x11, x12, [x1, #16] > add x2, x0, #48 > add x3, x0, #56 > ldp x13, x15, [x1, #32] > ldp x16, x17, [x1, #48] > str x9, [x0] > str x10, [x14] > add x9, x0, #16 > add x10, x0, #24 > add x14, x0, #32 > str x11, [x9] > str x12, [x10] > str x13, [x14] > str x15, [x18] > str x16, [x2] > str x17, [x3] > add x1, x1, #64 > add x0, x0, #64 > cmp x1, x8 > b.lo .LBB0_2 > > It doesn't matter for this series, but it seems like something ARM64 > might want to look at to improve.. > > Jason >