[PATCH0/5] camellia: cleanup, de-unrolling, and 64bit-ization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Hervert,

Please review and maybe propagate upstream following patches.

camellia1.diff:
    Move code blocks around so that related pieces are closer together:
    e.g. CAMELLIA_ROUNDSM macro does not need to be separated
    from the rest of the code by huge array of constants.

    Remove unused macros (COPY4WORD, SWAP4WORD, XOR4WORD[2])

    Drop SUBL(), SUBR() macros which only obscure things.
    Same for CAMELLIA_SP1110() macro and KEY_TABLE_TYPE typedef.

    Remove useless comments:
    /* encryption */ -- well it's obvious enough already!
    void camellia_encrypt128(...)

    Combine swap with copying at the beginning/end of encrypt/decrypt.


camellia2.diff
    Rename some macros to shorter names: CAMELLIA_RR8 -> ROR8,
    making it easier to understand that it is just a right rotation,
    nothing camellia-specific in it.
    CAMELLIA_SUBKEY_L() -> SUBKEY_L() - just shorter.

    Move be32 <-> cpu conversions out of en/decrypt128/256 and into
    camellia_en/decrypt - no reason to have that code duplicated twice.


camellia3.diff
    Optimize GETU32 to use 4-byte memcpy (modern gcc will convert
    such memcpy to single move instruction on i386).
    Original GETU32 did four byte fetches, and shifted/XORed those.


camellia4.diff
    Move huge unrolled pieces of code (3 screenfuls) at the end of
    128/256 key setup routines into common camellia_setup_tail(),
    convert it to loop there.
    Loop is still unrolled six times, so performance hit is very small,
    code size win is big.


camellia5.diff
    Use alternative key setup implementation with mostly 64-bit ops
    if BITS_PER_LONG >= 64. Both much smaller and much faster.

    Unify camellia_en/decrypt128/256 into camellia_do_en/decrypt.
    Code was similar, with just one additional if() we can use came code.

    If CONFIG_CC_OPTIMIZE_FOR_SIZE is defined,
    use loop in camellia_do_en/decrypt instead of unrolled code.
    ~5% encrypt/decrypt slowdown.

    Replace (x & 0xff) with (u8)x, gcc is not smart enough to realize
    that it can do (x & 0xff) this way (which is smaller at least on i386).

    Don't do (x & 0xff) in a few places where x cannot be > 255 anyway:
        t0 = il >> 16; v = camellia_sp0222[(t1 >> 8) & 0xff];
    il16 is u32, (thus t1 >> 8) is one byte!



Benchmarking was done in userspace (see attached tarball for code).
All times are in microseconds. Two runs give some idea of test variability.
"Setup NN: NNNNNN NNNNNN" - time taken by 100000 key setups (two runs).
"Encrypt: NNNNNN NNNNNN" - time taken by 1000 encryptions of 8K buffer.
"Decrypt: NNNNNN NNNNNN" - time taken by 1000 decryptions of 8K buffer.
"(matches)" - encrypt/decrypt cycle produced non corrupted plaintext.

CONFIG_CC_OPTIMIZE_FOR_SIZE is not set:

$ ./camellia
Setup 16:32779 33169 Encrypt:153582 153740 Decrypt:150985 149811 (matches)
Setup 24:49333 48987 Encrypt:197973 198853 Decrypt:201240 197585 (matches)
Setup 32:46700 47680 Encrypt:195650 195800 Decrypt:195450 195469 (matches)
$ ./camellia5
Setup 16:33417 32968 Encrypt:149195 149095 Decrypt:148593 148661 (matches)
Setup 24:50082 50064 Encrypt:201214 199204 Decrypt:197078 197579 (matches)
Setup 32:48938 48824 Encrypt:200231 199545 Decrypt:198954 198996 (matches)
$ ./camellia_64
Setup 16:22247 22473 Encrypt:152321 149860 Decrypt:149058 148451 (matches)
Setup 24:33832 34017 Encrypt:200428 202969 Decrypt:196789 195524 (matches)
Setup 32:32884 32821 Encrypt:200414 200640 Decrypt:197857 195987 (matches)

$ size camellia.o camellia7.o camellia_64.o
   text    data     bss     dec     hex filename
  24586       0       0   24586    600a camellia.o
  21714       0       0   21714    54d2 camellia5.o
  18666       0       0   18666    48ea camellia_64.o

Very small speed loss in camellia -> camellia5, noticeably smaller size.
Big key setup speedup in 64-bit camellia_64, and it is even smaller.


CONFIG_CC_OPTIMIZE_FOR_SIZE is set:

$ ./camellia_Os
Setup 16:32573 34985 Encrypt:151825 152011 Decrypt:147581 147630 (matches)
Setup 24:48528 49250 Encrypt:196223 199056 Decrypt:198811 196394 (matches)
Setup 32:46650 47538 Encrypt:197466 196412 Decrypt:196290 196550 (matches)
$ ./camellia5_Os
Setup 16:33360 34487 Encrypt:154718 154499 Decrypt:157432 157135 (matches)
Setup 24:53969 54304 Encrypt:205184 205818 Decrypt:210675 208552 (matches)
Setup 32:53064 52904 Encrypt:205350 205439 Decrypt:211654 208468 (matches)
$ ./camellia_64_Os
Setup 16:24696 25894 Encrypt:155903 155747 Decrypt:157385 155696 (matches)
Setup 24:33873 33230 Encrypt:206111 206385 Decrypt:208111 207650 (matches)
Setup 32:32799 32325 Encrypt:209715 205973 Decrypt:207578 207644 (matches)

$ size camellia_Os.o camellia7_Os.o camellia_64_Os.o
   text    data     bss     dec     hex filename
  24586       0       0   24586    600a camellia_Os.o
  15906       0       0   15906    3e22 camellia5_Os.o
  13098       0       0   13098    332a camellia_64_Os.o

Some speed loss in camellia -> camellia5, much smaller size.
Big key setup speedup in 64-bit camellia_64, and it is even smaller still.


Above sizes are for userspace test programs. Kernel sizes are similar.
For example, kernel module sizes with CONFIG_CC_OPTIMIZE_FOR_SIZE set, AMD64:

$ size */camellia.o
   text    data     bss     dec     hex filename
  23208     272       0   23480    5bb8 crypto.org/camellia.o
  11328     272       0   11600    2d50 crypto/camellia.o

Signed-off-by: Denys Vlasenko <vda.linux@xxxxxxxxxxxxxx>
--
vda

Attachment: test_camellia.tar.bz2
Description: application/tbz


[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]

  Powered by Linux