Hi! On Mon, May 23, 2022 at 04:09:20PM +0000, David Laight wrote: > On x86 (which I know a lot more about) memcpy() has a nasty > habit of getting implemented as 'rep movsb' relying on the > cpu to speed it up. > But that doesn't happen for uncached addresses - so you get > very slow byte copies. I have measured the performance with (patched) and without my change (orig). My change improves the performance on X8664 and arm. On Mips64 it stays the same: Tests ===== All runtimes are in milliseconds, average real-time of 3 runs, time measured with bash time built-in. Measured process run in SCHED_FIFO with priority 99. Page cache was flushed before every run, but all involved program images were in tmpfs (no swap). - dd r512 dd if=/dev/TESTDEV of=/dev/null bs=512 - dd r1MB dd if=/dev/TESTDEV of=/dev/null bs=1M - dd r512 dd of=/dev/TESTDEV if=/tmpfs/img bs=512 - dd r1MB dd of=/dev/TESTDEV if=/tmpfs/img bs=1M - flashcp flashcp /tmpfs/img /dev/TESTDEV - flasherase flash_eraseall -q /dev/TESTDEV Results ======= All times are in ms ARCH | MIPS64 | ARM | X8664 CPU | CN6335p2.2 | v7 TI K2 | Xeon D-1548 Dev. size | 32MB | 128MB | 256MB -----------+-------+---------+-------+---------+-------+--------- in ms | Orig | Patched | Orig | Patched | Orig | Patched dd r512 | 131 | 130 | 1101 | 543 | 22906 | 281 dd r1MB | 65 | 65 | 655 | 122 | 22715 | 70 dd w512 | 1150 | 1150 | 1136 | 1042 | 28067 | 412 dd w1MB | 104 | 104 | 396 | 244 | 27761 | 122 flashcp | 100 | 99 | 1438 | 568 | 78455 | 270 flasherase | 21 | 21 | 208 | 77 | 27707 | 57 BR, Petr