On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote: > * Guenter Roeck <linux@xxxxxxxxxxxx>: > > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add > > operations. The last add folds protocol and length fields into the 64-bit > > result. While unlikely, this operation can overflow. The overflow can be > > triggered with a code sequence such as the following. > > > > /* try to trigger massive overflows */ > > memset(tmp_buf, 0xff, sizeof(struct in6_addr)); > > csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf, > > (struct in6_addr *)tmp_buf, > > 0xffff, 0xff, 0xffffffff); > > > > Fix the problem by adding any overflows from the final add operation into > > the calculated checksum. Fortunately, we can do this without additional > > cost by replacing the add operation used to fold the checksum into 32 bit > > with "add,dc" to add in the missing carry. > > > Gunter, > > Thanks for your patch for 32- and 64-bit systems. > But I think it's time to sunset the parisc inline assembly for ipv4/ipv6 > checksumming. > The patch below converts the code to use standard C-coding (taken from the > s390 header file) and it survives the v8 lib/checksum patch. > > Opinions? > > Helge > > Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines > > Guenter noticed that the 32- and 64-bit checksum functions may calculate > wrong values under certain circumstances. He fixed it by usining > corrected carry-flags added, but overall I think it's better to convert > away from inline assembly to generic C-coding. That way the code is > much cleaner and the compiler can do much better optimizations based on > the target architecture (32- vs. 64-bit). Furthermore, the compiler > generates nowadays much better code, so inline assembly usually won't > give any benefit any longer. > > Signed-off-by: Helge Deller <deller@xxxxxx> > Noticed-by: Guenter Roeck <linux@xxxxxxxxxxxx> > > diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h > index 3c43baca7b39..c72f14536353 100644 > --- a/arch/parisc/include/asm/checksum.h > +++ b/arch/parisc/include/asm/checksum.h > @@ -18,160 +18,93 @@ > */ > extern __wsum csum_partial(const void *, int, __wsum); > > + > /* > - * Optimized for IP headers, which always checksum on 4 octet boundaries. > - * > - * Written by Randolph Chung <tausq@xxxxxxxxxx>, and then mucked with by > - * LaMont Jones <lamont@xxxxxxxxxx> > + * Fold a partial checksum without adding pseudo headers. > */ > -static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) > +static inline __sum16 csum_fold(__wsum sum) > { > - unsigned int sum; > - unsigned long t0, t1, t2; > + u32 csum = (__force u32) sum; > > - __asm__ __volatile__ ( > -" ldws,ma 4(%1), %0\n" > -" addib,<= -4, %2, 2f\n" > -"\n" > -" ldws 4(%1), %4\n" > -" ldws 8(%1), %5\n" > -" add %0, %4, %0\n" > -" ldws,ma 12(%1), %3\n" > -" addc %0, %5, %0\n" > -" addc %0, %3, %0\n" > -"1: ldws,ma 4(%1), %3\n" > -" addib,< 0, %2, 1b\n" > -" addc %0, %3, %0\n" > -"\n" > -" extru %0, 31, 16, %4\n" > -" extru %0, 15, 16, %5\n" > -" addc %4, %5, %0\n" > -" extru %0, 15, 16, %5\n" > -" add %0, %5, %0\n" > -" subi -1, %0, %0\n" > -"2:\n" > - : "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2) > - : "1" (iph), "2" (ihl) > - : "memory"); > - > - return (__force __sum16)sum; > + csum += (csum >> 16) | (csum << 16); > + csum >>= 16; > + return (__force __sum16) ~csum; > } For all of my analysis I am using gcc 12.3.0. We can do better than this! By inspection this looks like a performance regression. The generic version of csum_fold in include/asm-generic/checksum.h is better than this so should be used instead. I am not familiar with hppa assembly but this is the assembly for the original csum_fold here: 0000000000000000 <csum_fold>: 0: 08 03 02 41 copy r3,r1 4: 08 1e 02 43 copy sp,r3 8: 73 c1 00 88 std,ma r1,40(sp) c: d3 5a 09 fc shrpw r26,r26,16,ret0 10: 0b 9a 0a 1c add,l r26,ret0,ret0 14: 0b 80 09 9c uaddcm r0,ret0,ret0 18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0 1c: 34 7e 00 80 ldo 40(r3),sp 20: e8 40 d0 00 bve (rp) 24: 53 c3 3f 8d ldd,mb -40(sp),r3 This is the assembly for the generic version: 0000000000000000 <csum_fold_generic>: 0: 08 03 02 41 copy r3,r1 4: 08 1e 02 43 copy sp,r3 8: 73 c1 00 88 std,ma r1,40(sp) c: 0b 40 09 9c uaddcm r0,r26,ret0 10: d3 5a 09 fa shrpw r26,r26,16,r26 14: 0b 5c 04 1c sub ret0,r26,ret0 18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0 1c: 34 7e 00 80 ldo 40(r3),sp 20: e8 40 d0 00 bve (rp) 24: 53 c3 3f 8d ldd,mb -40(sp),r3 This is the assembly for yours: 0000000000000028 <csum_fold_new>: 28: 08 03 02 41 copy r3,r1 2c: 08 1e 02 43 copy sp,r3 30: 73 c1 00 88 std,ma r1,40(sp) 34: d3 5a 09 fc shrpw r26,r26,16,ret0 38: 0b 9a 0a 1c add,l r26,ret0,ret0 3c: d3 9c 19 f0 extrw,u ret0,15,16,ret0 40: 0b 80 09 9c uaddcm r0,ret0,ret0 44: db 9c 0b f0 extrd,u,* ret0,63,16,ret0 48: 34 7e 00 80 ldo 40(r3),sp 4c: e8 40 d0 00 bve (rp) 50: 53 c3 3f 8d ldd,mb -40(sp),r3 54: 00 00 00 00 break 0,0 The assembly is almost the same for the generic and the original code, except for rearranging the shift and add operation which allows the addition to become a subtraction and shortens the dependency chain on some architectures (looks like it doesn't change much here). However, this new code introduces an additional extrw instruction. > > /* > - * Fold a partial checksum > + * This is a version of ip_compute_csum() optimized for IP headers, > + * which always checksums on 4 octet boundaries. > */ > -static inline __sum16 csum_fold(__wsum csum) > +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) > { > - u32 sum = (__force u32)csum; > - /* add the swapped two 16-bit halves of sum, > - a possible carry from adding the two 16-bit halves, > - will carry from the lower half into the upper half, > - giving us the correct sum in the upper half. */ > - sum += (sum << 16) + (sum >> 16); > - return (__force __sum16)(~sum >> 16); > + __u64 csum = 0; > + __u32 *ptr = (u32 *)iph; > + > + csum += *ptr++; > + csum += *ptr++; > + csum += *ptr++; > + csum += *ptr++; > + ihl -= 4; > + while (ihl--) > + csum += *ptr++; > + csum += (csum >> 32) | (csum << 32); > + return csum_fold((__force __wsum)(csum >> 32)); > } This doesn't leverage add with carry well. This causes the code size of this to be dramatically larger than the original assembly, which I assume nicely correlates to an increased execution time. This is the original assembly in 64-bit (almost the same in 32-bit): 0000000000000028 <ip_fast_csum>: 28: 08 03 02 41 copy r3,r1 2c: 08 1e 02 43 copy sp,r3 30: 73 c1 00 88 std,ma r1,40(sp) 34: 0f 48 10 bc ldw,ma 4(r26),ret0 38: a7 39 60 70 addib,<= -4,r25,78 <ip_fast_csum+0x50> 3c: 0f 48 10 93 ldw 4(r26),r19 40: 0f 50 10 94 ldw 8(r26),r20 44: 0a 7c 06 1c add ret0,r19,ret0 48: 0f 58 10 bf ldw,ma c(r26),r31 4c: 0a 9c 07 1c add,c ret0,r20,ret0 50: 0b fc 07 1c add,c ret0,r31,ret0 54: 0f 48 10 bf ldw,ma 4(r26),r31 58: a7 20 5f ed addib,< 0,r25,54 <ip_fast_csum+0x2c> 5c: 0b fc 07 1c add,c ret0,r31,ret0 60: d3 93 1b f0 extrw,u ret0,31,16,r19 64: d3 94 19 f0 extrw,u ret0,15,16,r20 68: 0a 93 07 1c add,c r19,r20,ret0 6c: d3 94 19 f0 extrw,u ret0,15,16,r20 70: 0a 9c 06 1c add ret0,r20,ret0 74: 97 9c 07 ff subi -1,ret0,ret0 78: db 9c 0b f0 extrd,u,* ret0,63,16,ret0 7c: 34 7e 00 80 ldo 40(r3),sp 80: e8 40 d0 00 bve (rp) 84: 53 c3 3f 8d ldd,mb -40(sp),r3 This is the 64-bit assembly of your proposal: 0000000000000058 <ip_fast_csum_new>: 58: 08 03 02 41 copy r3,r1 5c: 0f c2 12 c1 std rp,-10(sp) 60: 08 1e 02 43 copy sp,r3 64: 73 c1 01 08 std,ma r1,80(sp) 68: 37 39 3f f9 ldo -4(r25),r25 6c: 0c 64 12 d0 std r4,8(r3) 70: 0f 48 10 9f ldw 4(r26),r31 74: db 39 0b e0 extrd,u,* r25,63,32,r25 78: 37 5a 00 20 ldo 10(r26),r26 7c: 0f 41 10 9c ldw -10(r26),ret0 80: 0b fc 0a 1c add,l ret0,r31,ret0 84: 0f 51 10 9f ldw -8(r26),r31 88: 0b 9f 0a 1f add,l r31,ret0,r31 8c: 0f 59 10 9c ldw -4(r26),ret0 90: 0b fc 0a 1c add,l ret0,r31,ret0 94: 37 3f 3f ff ldo -1(r25),r31 98: 8f ff 20 68 cmpib,<> -1,r31,d4 <ip_fast_csum_new+0x7c> 9c: db f9 0b e0 extrd,u,* r31,63,32,r25 a0: d3 9c 07 fa shrpd,* ret0,ret0,32,r26 a4: 37 dd 3f a1 ldo -30(sp),ret1 a8: 0b 9a 0a 1a add,l r26,ret0,r26 ac: 00 00 14 a1 mfia r1 b0: 28 20 00 00 addil L%0,r1,r1 b4: 34 21 00 00 ldo 0(r1),r1 b8: e8 20 f0 00 bve,l (r1),rp bc: db 5a 03 e0 extrd,u,* r26,31,32,r26 c0: 0c 70 10 c4 ldd 8(r3),r4 c4: 0c 61 10 c2 ldd -10(r3),rp c8: 34 7e 00 80 ldo 40(r3),sp cc: e8 40 d0 00 bve (rp) d0: 53 c3 3f 8d ldd,mb -40(sp),r3 d4: 0f 40 10 9f ldw 0(r26),r31 d8: 37 5a 00 08 ldo 4(r26),r26 dc: e8 1f 1f 65 b,l 94 <ip_fast_csum_new+0x3c>,r0 e0: 0b fc 0a 1c add,l ret0,r31,ret0 e4: 00 00 00 00 break 0,0 The 32-bit assembly is even longer. Maybe you can get some inspiration from my riscv implementation arch/riscv/include/asm/checksum.h. You can swap out the line: csum += csum < ((const unsigned int *)iph)[pos]; With the addc macro defined in arch/parisc/lib/checksum.c. I will guess that this will improve the code but I haven't checked. > - > -static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, > - __u32 len, __u8 proto, > - __wsum sum) > + > +/* > + * Computes the checksum of the TCP/UDP pseudo-header. > + * Returns a 32-bit checksum. > + */ > +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len, > + __u8 proto, __wsum sum) > { > - __asm__( > - " add %1, %0, %0\n" > - " addc %2, %0, %0\n" > - " addc %3, %0, %0\n" > - " addc %%r0, %0, %0\n" > - : "=r" (sum) > - : "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum)); > - return sum; > + __u64 csum = (__force __u64)sum; > + > + csum += (__force __u32)saddr; > + csum += (__force __u32)daddr; > + csum += len; > + csum += proto; > + csum += (csum >> 32) | (csum << 32); > + return (__force __wsum)(csum >> 32); > } The assembly from the original: 0000000000000088 <csum_tcpudp_nofold>: 88: 08 03 02 41 copy r3,r1 8c: 08 1e 02 43 copy sp,r3 90: 73 c1 00 88 std,ma r1,40(sp) 94: da f7 0b f8 extrd,u,* r23,63,8,r23 98: 0a f8 0a 17 add,l r24,r23,r23 9c: 0a d9 06 16 add r25,r22,r22 a0: 0a da 07 16 add,c r26,r22,r22 a4: 0a d7 07 16 add,c r23,r22,r22 a8: 0a c0 07 16 add,c r0,r22,r22 ac: da dc 0b e0 extrd,u,* r22,63,32,ret0 b0: 34 7e 00 80 ldo 40(r3),sp b4: e8 40 d0 00 bve (rp) b8: 53 c3 3f 8d ldd,mb -40(sp),r3 bc: 00 00 00 00 break 0,0 The 64-bit assembly from your proposal: 0000000000000140 <csum_tcpudp_nofold_new>: 140: 08 03 02 41 copy r3,r1 144: 08 1e 02 43 copy sp,r3 148: 73 c1 00 88 std,ma r1,40(sp) 14c: db 5a 0b e0 extrd,u,* r26,63,32,r26 150: db 39 0b e0 extrd,u,* r25,63,32,r25 154: da f7 0b f8 extrd,u,* r23,63,8,r23 158: db 18 0b e0 extrd,u,* r24,63,32,r24 15c: da d6 0b e0 extrd,u,* r22,63,32,r22 160: 0a f8 0a 18 add,l r24,r23,r24 164: 0b 38 0a 18 add,l r24,r25,r24 168: 0b 58 0a 18 add,l r24,r26,r24 16c: 0a d8 0a 16 add,l r24,r22,r22 170: d2 d6 07 fc shrpd,* r22,r22,32,ret0 174: 0a dc 0a 1c add,l ret0,r22,ret0 178: db 9c 03 e0 extrd,u,* ret0,31,32,ret0 17c: 34 7e 00 80 ldo 40(r3),sp 180: e8 40 d0 00 bve (rp) 184: 53 c3 3f 8d ldd,mb -40(sp),r3 There are a lot of extra extrd instructions, and again is really bad on 32-bit parisc. Maybe there is a good way to represent this in C, but the assembly is so simple and clean for this function already. > > /* > - * computes the checksum of the TCP/UDP pseudo-header > - * returns a 16-bit checksum, already complemented > + * Computes the checksum of the TCP/UDP pseudo-header. > + * Returns a 16-bit checksum, already complemented. > */ > -static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, > - __u32 len, __u8 proto, > - __wsum sum) > +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len, > + __u8 proto, __wsum sum) > { > - return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum)); > + return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum)); > } > The implementation of csum_tcpudp_magic is the same as the generic, so the generic version should be used instead. > /* > - * this routine is used for miscellaneous IP-like checksums, mainly > - * in icmp.c > + * Used for miscellaneous IP-like checksums, mainly icmp. > */ > -static inline __sum16 ip_compute_csum(const void *buf, int len) > +static inline __sum16 ip_compute_csum(const void *buff, int len) > { > - return csum_fold (csum_partial(buf, len, 0)); > + return csum_fold(csum_partial(buff, len, 0)); > } The generic version is better than this so it should be used instead. > > - > #define _HAVE_ARCH_IPV6_CSUM > -static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr, > - const struct in6_addr *daddr, > - __u32 len, __u8 proto, > - __wsum sum) > +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr, > + const struct in6_addr *daddr, > + __u32 len, __u8 proto, __wsum csum) > { > - unsigned long t0, t1, t2, t3; > - > - len += proto; /* add 16-bit proto + len */ > - > - __asm__ __volatile__ ( > - > -#if BITS_PER_LONG > 32 > - > - /* > - ** We can execute two loads and two adds per cycle on PA 8000. > - ** But add insn's get serialized waiting for the carry bit. > - ** Try to keep 4 registers with "live" values ahead of the ALU. > - */ > - > -" ldd,ma 8(%1), %4\n" /* get 1st saddr word */ > -" ldd,ma 8(%2), %5\n" /* get 1st daddr word */ > -" add %4, %0, %0\n" > -" ldd,ma 8(%1), %6\n" /* 2nd saddr */ > -" ldd,ma 8(%2), %7\n" /* 2nd daddr */ > -" add,dc %5, %0, %0\n" > -" add,dc %6, %0, %0\n" > -" add,dc %7, %0, %0\n" > -" add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */ > -" extrd,u %0, 31, 32, %4\n"/* copy upper half down */ > -" depdi 0, 31, 32, %0\n"/* clear upper half */ > -" add %4, %0, %0\n" /* fold into 32-bits */ > -" addc 0, %0, %0\n" /* add carry */ > - > -#else > - > - /* > - ** For PA 1.x, the insn order doesn't matter as much. > - ** Insn stream is serialized on the carry bit here too. > - ** result from the previous operation (eg r0 + x) > - */ > -" ldw,ma 4(%1), %4\n" /* get 1st saddr word */ > -" ldw,ma 4(%2), %5\n" /* get 1st daddr word */ > -" add %4, %0, %0\n" > -" ldw,ma 4(%1), %6\n" /* 2nd saddr */ > -" addc %5, %0, %0\n" > -" ldw,ma 4(%2), %7\n" /* 2nd daddr */ > -" addc %6, %0, %0\n" > -" ldw,ma 4(%1), %4\n" /* 3rd saddr */ > -" addc %7, %0, %0\n" > -" ldw,ma 4(%2), %5\n" /* 3rd daddr */ > -" addc %4, %0, %0\n" > -" ldw,ma 4(%1), %6\n" /* 4th saddr */ > -" addc %5, %0, %0\n" > -" ldw,ma 4(%2), %7\n" /* 4th daddr */ > -" addc %6, %0, %0\n" > -" addc %7, %0, %0\n" > -" addc %3, %0, %0\n" /* fold in proto+len, catch carry */ > - > -#endif > - : "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len), > - "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3) > - : "0" (sum), "1" (saddr), "2" (daddr), "3" (len) > - : "memory"); > - return csum_fold(sum); > + __u64 sum = (__force __u64)csum; > + > + sum += (__force __u32)saddr->s6_addr32[0]; > + sum += (__force __u32)saddr->s6_addr32[1]; > + sum += (__force __u32)saddr->s6_addr32[2]; > + sum += (__force __u32)saddr->s6_addr32[3]; > + sum += (__force __u32)daddr->s6_addr32[0]; > + sum += (__force __u32)daddr->s6_addr32[1]; > + sum += (__force __u32)daddr->s6_addr32[2]; > + sum += (__force __u32)daddr->s6_addr32[3]; > + sum += len; > + sum += proto; > + sum += (sum >> 32) | (sum << 32); > + return csum_fold((__force __wsum)(sum >> 32)); > } > > #endif > - Similar story again here where the add with carry is not well translated into C, resulting in significantly worse assembly. Using __u64 seems to be a big contributing factor for why the 32-bit assembly is worse. I am not sure the best way to represent this in a clean way in C. add with carry is not well understood by GCC 12.3 it seems. These functions are generally heavily optimized on every architecture, so I think it is worth it to default to assembly if you aren't able to achieve comparable performance in C. - Charlie