The patch titled Subject: lib/lzo: clean-up by introducing COPY16 has been added to the -mm tree. Its filename is lib-lzo-clean-up-by-introducing-copy16.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/lib-lzo-clean-up-by-introducing-copy16.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/lib-lzo-clean-up-by-introducing-copy16.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Matt Sealey <matt.sealey@xxxxxxx> Subject: lib/lzo: clean-up by introducing COPY16 Most compilers should be able to merge adjacent loads/stores of sizes which are less than but effect a multiple of a machine word size (in effect a memcpy() of a constant amount). However the semantics of the macro are that it just does the copy, the pointer increment is in the code, hence we see *a = *b a += 8 b += 8 *a = *b a += 8 b += 8 This introduces a dependency between the two groups of statements which seems to defeat said compiler optimizers and generate some very strange sequences of addition and subtraction of address offsets (i.e. it is overcomplicated). Since COPY8 is only ever used to copy amounts of 16 bytes (in pairs), just define COPY16 as COPY8,COPY8. We leave the definition to preserve the need to do unaligned accesses to machine-sized words per the original code intent, we just don't use it in the code proper. COPY16 then gives us code like: *a = *b *(a+8) = *(b+8) a += 16 b += 16 This seems to allow compilers to generate much better code by using base register writeback or simply positively incrementing offsets which seems to positively affect performance. It is, at least, fewer instructions to do the same job. Link: http://lkml.kernel.org/r/20181127161913.23863-3-dave.rodgman@xxxxxxx Signed-off-by: Matt Sealey <matt.sealey@xxxxxxx> Signed-off-by: Dave Rodgman <dave.rodgman@xxxxxxx> Cc: David S. Miller <davem@xxxxxxxxxxxxx> Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Cc: Markus F.X.J. Oberhumer <markus@xxxxxxxxxxxxx> Cc: Minchan Kim <minchan@xxxxxxxxxx> Cc: Nitin Gupta <nitingupta910@xxxxxxxxx> Cc: Richard Purdie <rpurdie@xxxxxxxxxxxxxx> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@xxxxxxxxx> Cc: Sonny Rao <sonnyrao@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- --- a/lib/lzo/lzo1x_compress.c~lib-lzo-clean-up-by-introducing-copy16 +++ a/lib/lzo/lzo1x_compress.c @@ -60,8 +60,7 @@ next: op += t; } else if (t <= 16) { *op++ = (t - 3); - COPY8(op, ii); - COPY8(op + 8, ii + 8); + COPY16(op, ii); op += t; } else { if (t <= 18) { @@ -76,8 +75,7 @@ next: *op++ = tt; } do { - COPY8(op, ii); - COPY8(op + 8, ii + 8); + COPY16(op, ii); op += 16; ii += 16; t -= 16; @@ -255,8 +253,7 @@ int lzo1x_1_compress(const unsigned char *op++ = tt; } if (t >= 16) do { - COPY8(op, ii); - COPY8(op + 8, ii + 8); + COPY16(op, ii); op += 16; ii += 16; t -= 16; --- a/lib/lzo/lzo1x_decompress_safe.c~lib-lzo-clean-up-by-introducing-copy16 +++ a/lib/lzo/lzo1x_decompress_safe.c @@ -86,12 +86,9 @@ copy_literal_run: const unsigned char *ie = ip + t; unsigned char *oe = op + t; do { - COPY8(op, ip); - op += 8; - ip += 8; - COPY8(op, ip); - op += 8; - ip += 8; + COPY16(op, ip); + op += 16; + ip += 16; } while (ip < ie); ip = ie; op = oe; @@ -187,12 +184,9 @@ copy_literal_run: unsigned char *oe = op + t; if (likely(HAVE_OP(t + 15))) { do { - COPY8(op, m_pos); - op += 8; - m_pos += 8; - COPY8(op, m_pos); - op += 8; - m_pos += 8; + COPY16(op, m_pos); + op += 16; + m_pos += 16; } while (op < oe); op = oe; if (HAVE_IP(6)) { --- a/lib/lzo/lzodefs.h~lib-lzo-clean-up-by-introducing-copy16 +++ a/lib/lzo/lzodefs.h @@ -23,6 +23,9 @@ COPY4(dst, src); COPY4((dst) + 4, (src) + 4) #endif +#define COPY16(dst, src) \ + do { COPY8(dst, src); COPY8((dst) + 8, (src) + 8); } while (0) + #if defined(__BIG_ENDIAN) && defined(__LITTLE_ENDIAN) #error "conflicting endian definitions" #elif defined(CONFIG_X86_64) _ Patches currently in -mm which might be from matt.sealey@xxxxxxx are lib-lzo-clean-up-by-introducing-copy16.patch lib-lzo-enable-64-bit-ctz-on-arm.patch lib-lzo-64-bit-ctz-on-arm64.patch lib-lzo-fast-8-byte-copy-on-arm64.patch