CC linux-m68k On Sun, Jul 2, 2023 at 5:19 PM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote: > On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein > <tip-bot2@xxxxxxxxxxxxx> wrote: > > The following commit has been merged into the x86/misc branch of tip: > > > > Commit-ID: 688eb8191b475db5acfd48634600b04fd3dda9ad > > Gitweb: https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad > > Author: Noah Goldstein <goldstein.w.n@xxxxxxxxx> > > AuthorDate: Wed, 10 May 2023 20:10:02 -05:00 > > Committer: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > > CommitterDate: Thu, 25 May 2023 10:55:18 -07:00 > > > > x86/csum: Improve performance of `csum_partial` > > > > 1) Add special case for len == 40 as that is the hottest value. The > > nets a ~8-9% latency improvement and a ~30% throughput improvement > > in the len == 40 case. > > > > 2) Use multiple accumulators in the 64-byte loop. This dramatically > > improves ILP and results in up to a 40% latency/throughput > > improvement (better for more iterations). > > > > Results from benchmarking on Icelake. Times measured with rdtsc() > > len lat_new lat_old r tput_new tput_old r > > 8 3.58 3.47 1.032 3.58 3.51 1.021 > > 16 4.14 4.02 1.028 3.96 3.78 1.046 > > 24 4.99 5.03 0.992 4.23 4.03 1.050 > > 32 5.09 5.08 1.001 4.68 4.47 1.048 > > 40 5.57 6.08 0.916 3.05 4.43 0.690 > > 48 6.65 6.63 1.003 4.97 4.69 1.059 > > 56 7.74 7.72 1.003 5.22 4.95 1.055 > > 64 6.65 7.22 0.921 6.38 6.42 0.994 > > 96 9.43 9.96 0.946 7.46 7.54 0.990 > > 128 9.39 12.15 0.773 8.90 8.79 1.012 > > 200 12.65 18.08 0.699 11.63 11.60 1.002 > > 272 15.82 23.37 0.677 14.43 14.35 1.005 > > 440 24.12 36.43 0.662 21.57 22.69 0.951 > > 952 46.20 74.01 0.624 42.98 53.12 0.809 > > 1024 47.12 78.24 0.602 46.36 58.83 0.788 > > 1552 72.01 117.30 0.614 71.92 96.78 0.743 > > 2048 93.07 153.25 0.607 93.28 137.20 0.680 > > 2600 114.73 194.30 0.590 114.28 179.32 0.637 > > 3608 156.34 268.41 0.582 154.97 254.02 0.610 > > 4096 175.01 304.03 0.576 175.89 292.08 0.602 > > > > There is no such thing as a free lunch, however, and the special case > > for len == 40 does add overhead to the len != 40 cases. This seems to > > amount to be ~5% throughput and slightly less in terms of latency. > > > > Testing: > > Part of this change is a new kunit test. The tests check all > > alignment X length pairs in [0, 64) X [0, 512). > > There are three cases. > > 1) Precomputed random inputs/seed. The expected results where > > generated use the generic implementation (which is assumed to be > > non-buggy). > > 2) An input of all 1s. The goal of this test is to catch any case > > a carry is missing. > > 3) An input that never carries. The goal of this test si to catch > > any case of incorrectly carrying. > > > > More exhaustive tests that test all alignment X length pairs in > > [0, 8192) X [0, 8192] on random data are also available here: > > https://github.com/goldsteinn/csum-reproduction > > > > The reposity also has the code for reproducing the above benchmark > > numbers. > > > > Signed-off-by: Noah Goldstein <goldstein.w.n@xxxxxxxxx> > > Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > > Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum: > Improve performance of `csum_partial`") in linus/master stable/master > > > Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com > > This does not seem to be a message sent to a public mailing list > archived at lore (yet). > > On m68k (ARAnyM): > > KTAP version 1 > # Subtest: checksum > 1..3 > # test_csum_fixed_random_inputs: ASSERTION FAILED at > lib/checksum_kunit.c:243 > Expected result == expec, but > result == 54991 (0xd6cf) > expec == 33316 (0x8224) > not ok 1 test_csum_fixed_random_inputs > # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267 > Expected result == expec, but > result == 255 (0xff) > expec == 65280 (0xff00) > > Endianness issue in the test? > > not ok 2 test_csum_all_carry_inputs > # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306 > Expected result == expec, but > result == 64515 (0xfc03) > expec == 0 (0x0) > not ok 3 test_csum_no_carry_inputs > # checksum: pass:0 fail:3 skip:0 total:3 > # Totals: pass:0 fail:3 skip:0 total:3 > not ok 1 checksum As this test already exposed bugs on x86 and UML: https://lore.kernel.org/all/20230704083206.693155-2-davidgow@xxxxxxxxxx https://lore.kernel.org/linux-um/20230704083022.692368-2-davidgow@xxxxxxxxxx there's probably a real bug in the m68k csumming code. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds