On Sun, 27 Jun 2021 02:19:59 +0200 Matteo Croce <mcroce@xxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Jun 25, 2021 at 7:45 PM Nick Desaulniers > <ndesaulniers@xxxxxxxxxx> wrote: > > > > On Thu, Jun 24, 2021 at 6:02 PM Matteo Croce > > <mcroce@xxxxxxxxxxxxxxxxxxx> wrote: > > > > > > From: Matteo Croce <mcroce@xxxxxxxxxxxxx> > > > > > > Rewrite the generic mem{cpy,move,set} so that memory is accessed > > > with the widest size possible, but without doing unaligned > > > accesses. > > > > > > This was originally posted as C string functions for RISC-V[1], > > > but as there was no specific RISC-V code, it was proposed for the > > > generic lib/string.c implementation. > > > > > > Tested on RISC-V and on x86_64 by undefining > > > __HAVE_ARCH_MEM{CPY,SET,MOVE} and HAVE_EFFICIENT_UNALIGNED_ACCESS. > > > > > > Further testing on big endian machines will be appreciated, as I > > > don't have such hardware at the moment. > > > > Hi Matteo, > > Neat patches. Do you have you any benchmark data showing the > > claimed improvements? Is it worthwhile to define these only when > > CC_OPTIMIZE_FOR_PERFORMANCE/CC_OPTIMIZE_FOR_PERFORMANCE_O3 are > > defined, not CC_OPTIMIZE_FOR_SIZE? I'd be curious to know the delta > > in ST_SIZE of these functions otherwise. > > > > I compared the current versions with the new one with bloat-o-meter, > the kernel grows by ~400 bytes on x86_64 and RISC-V > > x86_64 > > $ scripts/bloat-o-meter vmlinux.orig vmlinux > add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421) > Function old new delta > memcpy 29 351 +322 > memset 29 117 +88 > strlcat 68 78 +10 > strlcpy 50 57 +7 > memmove 56 50 -6 > Total: Before=8556964, After=8557385, chg +0.00% > > RISC-V > > $ scripts/bloat-o-meter vmlinux.orig vmlinux > add/remove: 0/0 grow/shrink: 4/2 up/down: 432/-36 (396) > Function old new delta > memcpy 36 324 +288 > memset 32 148 +116 > strlcpy 116 132 +16 > strscpy_pad 84 96 +12 > strlcat 176 164 -12 > memmove 76 52 -24 > Total: Before=1225371, After=1225767, chg +0.03% > > I will post benchmarks made on a RISC-V machine which can't handle > unaligned accesses, and it will be the first user of the new > functions. > > > For big endian, you ought to be able to boot test in QEMU. I think > > you'd find out pretty quickly if any of the above had issues. > > (Enabling KASAN is probably also a good idea for a test, too). Check > > out > > https://github.com/ClangBuiltLinux/boot-utils > > for ready made images and scripts for launching various > > architectures and endiannesses. > > > > Will do! > I finally made the benchmarks on RISC-V. The current byte-at-time memcpy() always copy 74 Mbyte/sec, no matter the alignment of the buffers, while my implementation do 114 Mb/s on two aligned buffers and 107 Mb/s when need to shift and merge words. For memset(), the current byte-at-time implementation always writes 140 Mb/s, my word-wise implementation always copies 241 Mb/s. Both implementations have the same performance with aligned and unaligned buffers. The memcpy() test is a simple loop which calls memcpy with different offsets on a 32MB buffer: #define NUM_PAGES (1 << (MAX_ORDER + 2)) #define PG_SIZE (PAGE_SIZE * NUM_PAGES) page1 = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0); page2 = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0); src = page_to_virt(page1); dst = page_to_virt(page2); for (i = 0; i < sizeof(void*); i++) { for (j = 0; j < sizeof(void*); j++) { t0 = ktime_get(); memcpy(dst + i, src + j, PG_SIZE - max(i, j)); t1 = ktime_get(); printk("Strings selftest: memcpy(dst+%d, src+%d), distance %lu: %llu Mb/s\n", i, j, (j - i) % sizeof(long), PG_SIZE * (1000000000l / 1048576l) / (t1-t0)); } printk("\n"); } Similarly, the memset() one: page = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0); dst = page_to_virt(page); for (i = 0; i < sizeof(void*); i++) { t0 = ktime_get(); memset(dst + i, 0, PG_SIZE - i); t1 = ktime_get(); printk("Strings selftest: memset(dst+%d): %llu Mb/s\n", i, PG_SIZE * (1000000000l / 1048576l) / (t1-t0)); } The results for the current one are: [ 27.893931] Strings selftest: memcpy testing with size: 32 Mb [ 28.315272] Strings selftest: memcpy(dst+0, src+0), distance 0: 75 Mb/s [ 28.736485] Strings selftest: memcpy(dst+0, src+1), distance 1: 75 Mb/s [ 29.156826] Strings selftest: memcpy(dst+0, src+2), distance 2: 76 Mb/s [ 29.576199] Strings selftest: memcpy(dst+0, src+3), distance 3: 76 Mb/s [ 29.994360] Strings selftest: memcpy(dst+0, src+4), distance 4: 76 Mb/s [ 30.411766] Strings selftest: memcpy(dst+0, src+5), distance 5: 76 Mb/s [ 30.828124] Strings selftest: memcpy(dst+0, src+6), distance 6: 76 Mb/s [ 31.243514] Strings selftest: memcpy(dst+0, src+7), distance 7: 76 Mb/s [...] [ 52.077251] Strings selftest: memcpy(dst+7, src+0), distance 1: 74 Mb/s [ 52.508115] Strings selftest: memcpy(dst+7, src+1), distance 2: 74 Mb/s [ 52.939309] Strings selftest: memcpy(dst+7, src+2), distance 3: 74 Mb/s [ 53.370493] Strings selftest: memcpy(dst+7, src+3), distance 4: 74 Mb/s [ 53.801865] Strings selftest: memcpy(dst+7, src+4), distance 5: 74 Mb/s [ 54.233287] Strings selftest: memcpy(dst+7, src+5), distance 6: 74 Mb/s [ 54.664990] Strings selftest: memcpy(dst+7, src+6), distance 7: 74 Mb/s [ 55.086996] Strings selftest: memcpy(dst+7, src+7), distance 0: 75 Mb/s [ 55.109680] Strings selftest: memset testing with size: 32 Mb [ 55.337873] Strings selftest: memset(dst+0): 140 Mb/s [ 55.565905] Strings selftest: memset(dst+1): 140 Mb/s [ 55.793987] Strings selftest: memset(dst+2): 140 Mb/s [ 56.022140] Strings selftest: memset(dst+3): 140 Mb/s [ 56.250259] Strings selftest: memset(dst+4): 140 Mb/s [ 56.478283] Strings selftest: memset(dst+5): 140 Mb/s [ 56.706296] Strings selftest: memset(dst+6): 140 Mb/s [ 56.934335] Strings selftest: memset(dst+7): 140 Mb/s While for the proposed one: [ 38.843970] Strings selftest: memcpy testing with size: 32 Mb [ 39.124047] Strings selftest: memcpy(dst+0, src+0), distance 0: 114 Mb/s [ 39.421848] Strings selftest: memcpy(dst+0, src+1), distance 1: 107 Mb/s [ 39.719613] Strings selftest: memcpy(dst+0, src+2), distance 2: 107 Mb/s [ 40.017310] Strings selftest: memcpy(dst+0, src+3), distance 3: 107 Mb/s [ 40.314939] Strings selftest: memcpy(dst+0, src+4), distance 4: 107 Mb/s [ 40.612485] Strings selftest: memcpy(dst+0, src+5), distance 5: 107 Mb/s [ 40.910054] Strings selftest: memcpy(dst+0, src+6), distance 6: 107 Mb/s [ 41.207577] Strings selftest: memcpy(dst+0, src+7), distance 7: 107 Mb/s [...] [ 55.682517] Strings selftest: memcpy(dst+7, src+0), distance 1: 107 Mb/s [ 55.980262] Strings selftest: memcpy(dst+7, src+1), distance 2: 107 Mb/s [ 56.277862] Strings selftest: memcpy(dst+7, src+2), distance 3: 107 Mb/s [ 56.575514] Strings selftest: memcpy(dst+7, src+3), distance 4: 107 Mb/s [ 56.873142] Strings selftest: memcpy(dst+7, src+4), distance 5: 107 Mb/s [ 57.170840] Strings selftest: memcpy(dst+7, src+5), distance 6: 107 Mb/s [ 57.468553] Strings selftest: memcpy(dst+7, src+6), distance 7: 107 Mb/s [ 57.748231] Strings selftest: memcpy(dst+7, src+7), distance 0: 114 Mb/s [ 57.772721] Strings selftest: memset testing with size: 32 Mb [ 57.905358] Strings selftest: memset(dst+0): 241 Mb/s [ 58.037974] Strings selftest: memset(dst+1): 241 Mb/s [ 58.170619] Strings selftest: memset(dst+2): 241 Mb/s [ 58.303228] Strings selftest: memset(dst+3): 241 Mb/s [ 58.435808] Strings selftest: memset(dst+4): 241 Mb/s [ 58.568373] Strings selftest: memset(dst+5): 241 Mb/s [ 58.700968] Strings selftest: memset(dst+6): 241 Mb/s [ 58.833499] Strings selftest: memset(dst+7): 241 Mb/s Anyway, I have to submit a v2 because the current one fails to build on compilers which can't understand that byte_long is constant: lib/string.c:52:39: error: initializer element is not constant static const unsigned int word_mask = bytes_long - 1; ^~~~~~~~~~ Regards, -- per aspera ad upstream