Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 24 Jan 2024 15:26:34 -0400

On Wed, Jan 24, 2024 at 05:22:05PM +0000, Catalin Marinas wrote:
> On Wed, Jan 24, 2024 at 09:27:19AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 24, 2024 at 12:40:29PM +0000, Catalin Marinas wrote:
> > 
> > > > Just to be clear, that means we should drop this patch ("arm64/io: add
> > > > memcpy_toio_64") for now, right?
> > > 
> > > In its current form yes, but that doesn't mean that memcpy_toio_64()
> > > cannot be reworked differently.
> > 
> > I gave up on touching memcpy_toio_64(), it doesn't work very well
> > because of the weak alignment
> > 
> > Instead I followed your suggestion to fix __iowrite64_copy()
> 
> I forgot the details. Was it to introduce an __iowrite512_copy()
> function or to simply use __iowrite64_copy() with a count of 8?

Count of 8

> Just invoking __iowrite64_copy() with a count of 8 wouldn't work well
> even if we have the writeq generating STR with an offset (well, it also
> increments the pointers, so I don't think Mark's optimisation would
> help). The copy implies loads and these would be interleaved with stores
> and potentially get in the way of write combining. An
> __iowrite512_copy() or maybe even an optimised __iowrite64_copy() for
> count 8 could do the loads first followed by the stores. You can use a
> special path in __iowrite64_copy() with __builtin_contant_p().

I did exactly the latter like this:

static inline void __const_memcpy_toio_aligned64(volatile u64 __iomem *to,
						 const u64 *from, size_t count)
{
	switch (count) {
	case 8:
		asm volatile("stp %x0, %x1, [%8, #16 * 0]\n"
			     "stp %x2, %x3, [%8, #16 * 1]\n"
			     "stp %x4, %x5, [%8, #16 * 2]\n"
			     "stp %x6, %x7, [%8, #16 * 3]\n"
			     :
			     : "rZ"(from[0]), "rZ"(from[1]), "rZ"(from[2]),
			       "rZ"(from[3]), "rZ"(from[4]), "rZ"(from[5]),
			       "rZ"(from[6]), "rZ"(from[7]), "r"(to));
		break;
	case 4:
		asm volatile("stp %x0, %x1, [%4, #16 * 0]\n"
			     "stp %x2, %x3, [%4, #16 * 1]\n"
			     :
			     : "rZ"(from[0]), "rZ"(from[1]), "rZ"(from[2]),
			       "rZ"(from[3]), "r"(to));
		break;
	case 2:
		asm volatile("stp %x0, %x1, [%2, #16 * 0]\n"
			     :
			     : "rZ"(from[0]), "rZ"(from[1]), "r"(to));
		break;
	case 1:
		__raw_writeq(*from, to);
		break;
	default:
		BUILD_BUG();
	}
}

void __iowrite64_copy_full(void __iomem *to, const void *from, size_t count);

static inline void __const_iowrite64_copy(void __iomem *to, const void *from,
					  size_t count)
{
	if (count == 8 || count == 4 || count == 2 || count == 1) {
		__const_memcpy_toio_aligned64(to, from, count);
		dgh();
	} else {
		__iowrite64_copy_full(to, from, count);
	}
}

#define __iowrite64_copy(to, from, count)                  \
	(__builtin_constant_p(count) ?                     \
		 __const_iowrite64_copy(to, from, count) : \
		 __iowrite64_copy_full(to, from, count))

And the out of line __iowrite64_copy_full() generates good
assembly that loops 8/4/2/1 sized blocks.

I was going to send it out yesterday but am waiting for some
conclusion on the STP.

https://github.com/jgunthorpe/linux/commits/mlx5_wc/

> void __iowrite64_copy(void __iomem *to, const void *from,
> 		      size_t count)
> {
> 	u64 __iomem *dst = to;
> 	const u64 *src = from;
> 	const u64 *end = src + count;
> 
> 	/*
> 	 * Try a 64-byte write, the CPUs tend to write-combine them.
> 	 */
> 	if (__builtin_contant_p(count) && count == 8) {
> 		__raw_writeq(*src, dst);
> 		__raw_writeq(*(src + 1), dst + 1);
> 		__raw_writeq(*(src + 2), dst + 2);
> 		__raw_writeq(*(src + 3), dst + 3);
> 		__raw_writeq(*(src + 4), dst + 4);
> 		__raw_writeq(*(src + 5), dst + 5);
> 		__raw_writeq(*(src + 6), dst + 6);
> 		__raw_writeq(*(src + 7), dst + 7);
> 		return;
> 	}

I already looked at this, clang with the "Qo" constraint does:

ffffffc08086e6ec:       f9400029        ldr     x9, [x1]
ffffffc08086e6f0:       91002008        add     x8, x0, #0x8
ffffffc08086e6f4:       f9000009        str     x9, [x0]
ffffffc08086e6f8:       f9400429        ldr     x9, [x1, #8]
ffffffc08086e6fc:       f9000109        str     x9, [x8]
ffffffc08086e700:       91004008        add     x8, x0, #0x10
ffffffc08086e704:       f9400829        ldr     x9, [x1, #16]
ffffffc08086e708:       f9000109        str     x9, [x8]
ffffffc08086e70c:       91006008        add     x8, x0, #0x18
ffffffc08086e710:       f9400c29        ldr     x9, [x1, #24]
ffffffc08086e714:       f9000109        str     x9, [x8]
ffffffc08086e718:       91008008        add     x8, x0, #0x20
ffffffc08086e71c:       f9401029        ldr     x9, [x1, #32]
ffffffc08086e720:       f9000109        str     x9, [x8]
ffffffc08086e724:       9100a008        add     x8, x0, #0x28
ffffffc08086e728:       f9401429        ldr     x9, [x1, #40]
ffffffc08086e72c:       f9000109        str     x9, [x8]
ffffffc08086e730:       9100c008        add     x8, x0, #0x30
ffffffc08086e734:       f9401829        ldr     x9, [x1, #48]
ffffffc08086e738:       f9000109        str     x9, [x8]
ffffffc08086e73c:       f9401c28        ldr     x8, [x1, #56]
ffffffc08086e740:       9100e009        add     x9, x0, #0x38
ffffffc08086e744:       f9000128        str     x8, [x9]

Which is not good. Gcc is a better, but not perfect.

> What we don't have is inlining of __iowrite64_copy() but if we need that
> we can move away from a weak symbol to a static inline.

Yes I did this as well. It helps s390 and x86 nicely too.

> Give this a go and see if it you get write-combining in your hardware.
> If the loads interleaves with stores get in the way, maybe we can resort
> to inline asm.

For reference the actual assembly (see post_send_nop()) that fails is:

   13534:       d503201f        nop
   13538:       93407ea1        sxtw    x1, w21
   1353c:       f100403f        cmp     x1, #0x10
   13540:       54000488        b.hi    135d0 <post_send_nop.isra.0+0x260>  // b.pmore
   13544:       a9408a63        ldp     x3, x2, [x19, #8]
   13548:       f84086c4        ldr     x4, [x22], #8
   1354c:       f9400042        ldr     x2, [x2]
   13550:       8b030283        add     x3, x20, x3
   13554:       8b030042        add     x2, x2, x3
   13558:       f9000044        str     x4, [x2]
   1355c:       91002294        add     x20, x20, #0x8
   13560:       11000ab5        add     w21, w21, #0x2
   13564:       f101029f        cmp     x20, #0x40
   13568:       54fffe81        b.ne    13538 <post_send_nop.isra.0+0x1c8>  // b.any
   1356c:       d50320df        hint    #0x6

Not very good code the compiler wrote (the main issue is that it
reloads the dest pointer every iteration), but still, all those loads
are coming from memory that was recently touched so should be in-cache
most of the time. So it isn't like we are sitting around waiting for a
lengthy dcache fill and timing out the WC buffer.

However, it is 136 instructions, so it feels like the issue may be the
write combining buffer auto-flushes in less. Maybe it auto-flushes
after 128/64/32/16/8 cycles now. I know there has been a tension to
reduce WC latency vs maximum aggregation.

The suggestion that it should not have any interleaving instructions
and use STP came from our CPU architecture team.

The assembly I have been able to get tested from this series that did
works is this:

ffffffc08086ec84:       d5033e9f        dsb     st
ffffffc08086ec88:       f941de6b        ldr     x11, [x19, #952]
ffffffc08086ec8c:       f941da6c        ldr     x12, [x19, #944]
ffffffc08086ec90:       f940016b        ldr     x11, [x11]
ffffffc08086ec94:       8b0c016b        add     x11, x11, x12
ffffffc08086ec98:       a9002969        stp     x9, x10, [x11]
ffffffc08086ec9c:       a9012168        stp     x8, x8, [x11, #16]
ffffffc08086eca0:       a9022168        stp     x8, x8, [x11, #32]
ffffffc08086eca4:       a9032168        stp     x8, x8, [x11, #48]
ffffffc08086eca8:       d50320df        hint    #0x6

The revised __iowrite64_copy() version also creates this assembly.

The ST4 based thing in userspace also works.

Remember there are two related topics here.. mlx5 would like high
frequency of large TLP generation, but doesn't care about raw
performance. If the 24 instructions clang generates does that then
great.

hns/broadcom/others need the large TLP and care about performance. In
that case the stp block is the best we can do in the kernel as st4 is
off the table.

I would like the architecture code to do a good job for performance
since it is a generic API for all drivers.

Regarding the 8x STR option, I have to get that tested.

Jason