On Wed, Jan 24, 2024 at 05:22:05PM +0000, Catalin Marinas wrote: > On Wed, Jan 24, 2024 at 09:27:19AM -0400, Jason Gunthorpe wrote: > > On Wed, Jan 24, 2024 at 12:40:29PM +0000, Catalin Marinas wrote: > > > > > > Just to be clear, that means we should drop this patch ("arm64/io: add > > > > memcpy_toio_64") for now, right? > > > > > > In its current form yes, but that doesn't mean that memcpy_toio_64() > > > cannot be reworked differently. > > > > I gave up on touching memcpy_toio_64(), it doesn't work very well > > because of the weak alignment > > > > Instead I followed your suggestion to fix __iowrite64_copy() > > I forgot the details. Was it to introduce an __iowrite512_copy() > function or to simply use __iowrite64_copy() with a count of 8? Count of 8 > Just invoking __iowrite64_copy() with a count of 8 wouldn't work well > even if we have the writeq generating STR with an offset (well, it also > increments the pointers, so I don't think Mark's optimisation would > help). The copy implies loads and these would be interleaved with stores > and potentially get in the way of write combining. An > __iowrite512_copy() or maybe even an optimised __iowrite64_copy() for > count 8 could do the loads first followed by the stores. You can use a > special path in __iowrite64_copy() with __builtin_contant_p(). I did exactly the latter like this: static inline void __const_memcpy_toio_aligned64(volatile u64 __iomem *to, const u64 *from, size_t count) { switch (count) { case 8: asm volatile("stp %x0, %x1, [%8, #16 * 0]\n" "stp %x2, %x3, [%8, #16 * 1]\n" "stp %x4, %x5, [%8, #16 * 2]\n" "stp %x6, %x7, [%8, #16 * 3]\n" : : "rZ"(from[0]), "rZ"(from[1]), "rZ"(from[2]), "rZ"(from[3]), "rZ"(from[4]), "rZ"(from[5]), "rZ"(from[6]), "rZ"(from[7]), "r"(to)); break; case 4: asm volatile("stp %x0, %x1, [%4, #16 * 0]\n" "stp %x2, %x3, [%4, #16 * 1]\n" : : "rZ"(from[0]), "rZ"(from[1]), "rZ"(from[2]), "rZ"(from[3]), "r"(to)); break; case 2: asm volatile("stp %x0, %x1, [%2, #16 * 0]\n" : : "rZ"(from[0]), "rZ"(from[1]), "r"(to)); break; case 1: __raw_writeq(*from, to); break; default: BUILD_BUG(); } } void __iowrite64_copy_full(void __iomem *to, const void *from, size_t count); static inline void __const_iowrite64_copy(void __iomem *to, const void *from, size_t count) { if (count == 8 || count == 4 || count == 2 || count == 1) { __const_memcpy_toio_aligned64(to, from, count); dgh(); } else { __iowrite64_copy_full(to, from, count); } } #define __iowrite64_copy(to, from, count) \ (__builtin_constant_p(count) ? \ __const_iowrite64_copy(to, from, count) : \ __iowrite64_copy_full(to, from, count)) And the out of line __iowrite64_copy_full() generates good assembly that loops 8/4/2/1 sized blocks. I was going to send it out yesterday but am waiting for some conclusion on the STP. https://github.com/jgunthorpe/linux/commits/mlx5_wc/ > void __iowrite64_copy(void __iomem *to, const void *from, > size_t count) > { > u64 __iomem *dst = to; > const u64 *src = from; > const u64 *end = src + count; > > /* > * Try a 64-byte write, the CPUs tend to write-combine them. > */ > if (__builtin_contant_p(count) && count == 8) { > __raw_writeq(*src, dst); > __raw_writeq(*(src + 1), dst + 1); > __raw_writeq(*(src + 2), dst + 2); > __raw_writeq(*(src + 3), dst + 3); > __raw_writeq(*(src + 4), dst + 4); > __raw_writeq(*(src + 5), dst + 5); > __raw_writeq(*(src + 6), dst + 6); > __raw_writeq(*(src + 7), dst + 7); > return; > } I already looked at this, clang with the "Qo" constraint does: ffffffc08086e6ec: f9400029 ldr x9, [x1] ffffffc08086e6f0: 91002008 add x8, x0, #0x8 ffffffc08086e6f4: f9000009 str x9, [x0] ffffffc08086e6f8: f9400429 ldr x9, [x1, #8] ffffffc08086e6fc: f9000109 str x9, [x8] ffffffc08086e700: 91004008 add x8, x0, #0x10 ffffffc08086e704: f9400829 ldr x9, [x1, #16] ffffffc08086e708: f9000109 str x9, [x8] ffffffc08086e70c: 91006008 add x8, x0, #0x18 ffffffc08086e710: f9400c29 ldr x9, [x1, #24] ffffffc08086e714: f9000109 str x9, [x8] ffffffc08086e718: 91008008 add x8, x0, #0x20 ffffffc08086e71c: f9401029 ldr x9, [x1, #32] ffffffc08086e720: f9000109 str x9, [x8] ffffffc08086e724: 9100a008 add x8, x0, #0x28 ffffffc08086e728: f9401429 ldr x9, [x1, #40] ffffffc08086e72c: f9000109 str x9, [x8] ffffffc08086e730: 9100c008 add x8, x0, #0x30 ffffffc08086e734: f9401829 ldr x9, [x1, #48] ffffffc08086e738: f9000109 str x9, [x8] ffffffc08086e73c: f9401c28 ldr x8, [x1, #56] ffffffc08086e740: 9100e009 add x9, x0, #0x38 ffffffc08086e744: f9000128 str x8, [x9] Which is not good. Gcc is a better, but not perfect. > What we don't have is inlining of __iowrite64_copy() but if we need that > we can move away from a weak symbol to a static inline. Yes I did this as well. It helps s390 and x86 nicely too. > Give this a go and see if it you get write-combining in your hardware. > If the loads interleaves with stores get in the way, maybe we can resort > to inline asm. For reference the actual assembly (see post_send_nop()) that fails is: 13534: d503201f nop 13538: 93407ea1 sxtw x1, w21 1353c: f100403f cmp x1, #0x10 13540: 54000488 b.hi 135d0 <post_send_nop.isra.0+0x260> // b.pmore 13544: a9408a63 ldp x3, x2, [x19, #8] 13548: f84086c4 ldr x4, [x22], #8 1354c: f9400042 ldr x2, [x2] 13550: 8b030283 add x3, x20, x3 13554: 8b030042 add x2, x2, x3 13558: f9000044 str x4, [x2] 1355c: 91002294 add x20, x20, #0x8 13560: 11000ab5 add w21, w21, #0x2 13564: f101029f cmp x20, #0x40 13568: 54fffe81 b.ne 13538 <post_send_nop.isra.0+0x1c8> // b.any 1356c: d50320df hint #0x6 Not very good code the compiler wrote (the main issue is that it reloads the dest pointer every iteration), but still, all those loads are coming from memory that was recently touched so should be in-cache most of the time. So it isn't like we are sitting around waiting for a lengthy dcache fill and timing out the WC buffer. However, it is 136 instructions, so it feels like the issue may be the write combining buffer auto-flushes in less. Maybe it auto-flushes after 128/64/32/16/8 cycles now. I know there has been a tension to reduce WC latency vs maximum aggregation. The suggestion that it should not have any interleaving instructions and use STP came from our CPU architecture team. The assembly I have been able to get tested from this series that did works is this: ffffffc08086ec84: d5033e9f dsb st ffffffc08086ec88: f941de6b ldr x11, [x19, #952] ffffffc08086ec8c: f941da6c ldr x12, [x19, #944] ffffffc08086ec90: f940016b ldr x11, [x11] ffffffc08086ec94: 8b0c016b add x11, x11, x12 ffffffc08086ec98: a9002969 stp x9, x10, [x11] ffffffc08086ec9c: a9012168 stp x8, x8, [x11, #16] ffffffc08086eca0: a9022168 stp x8, x8, [x11, #32] ffffffc08086eca4: a9032168 stp x8, x8, [x11, #48] ffffffc08086eca8: d50320df hint #0x6 The revised __iowrite64_copy() version also creates this assembly. The ST4 based thing in userspace also works. Remember there are two related topics here.. mlx5 would like high frequency of large TLP generation, but doesn't care about raw performance. If the 24 instructions clang generates does that then great. hns/broadcom/others need the large TLP and care about performance. In that case the stp block is the best we can do in the kernel as st4 is off the table. I would like the architecture code to do a good job for performance since it is a generic API for all drivers. Regarding the 8x STR option, I have to get that tested. Jason