I wonder if it would be useful to have something like this in tree. It states trivial things for anyone who looked at disassembly few times but still... Signed-off-by: Alexey Dobriyan <adobriyan@xxxxxxxxx> --- Documentation/process/code-generation.rst | 196 ++++++++++++++++++++++++++++++ 1 file changed, 196 insertions(+) new file mode 100644 --- /dev/null +++ b/Documentation/process/code-generation.rst @@ -0,0 +1,196 @@ +Code generation +=============== + +1) Generic techniques +--------------------- + +### a) Inlining/uninlining function calls ### + +External function call is serious business from code generation point of view. +ABIs require that specific arguments are placed into specific registers before +doing the call forcing spilling and register shuffling to accomodate ABI rules. +Clobbered registers which aren't used by a function are wasted. Declaring +function as ``static inline`` in a header gives compiler more information +to work with. + +However, excessing inlining often leads to code bloat for no measurable +performance gains. In such case it is probably better to save on generated code +for icache, disk I/O and network bandwidth costs. + +Use ``noinline`` attribute to prevent inlining inside translation unit and +see what happens: + +.. code-block:: c + + noinline + int f() + { + ... + } + +It is hard to advice any more than that as modern compilers generate code in +mysterious ways. + + +### b) Appending arguments ### + +Some functions are thin wrappers appending an argument or two to another +function which actually does the job: + +.. code-block:: c + + int g(int, int, flag_t); + int f(int a, int b) + { + return g(a, b, FLAG_C); + } + +Appending an argument at the end adds minimum amount of code: + +.. code-block:: none + + f: + mov edx, FLAG_C + jmp g + +Appending an argument in the middle or in the beginning will generate +reshuffle sequence: + +.. code-block:: none + + f: + mov edx, esi + mov esi, edi + mov edi, FLAG_C + jmp g + +Do not enforce this rule religiously as there may be other reasons for +specific argument order most notably keeping related arguments together +at source level. + + +2) Architecture specific issues (i386/x86_64) +--------------------------------------------- + +### a) Member placement ### + +First member of any structure is very special on i386/x86_64: compiler will +use ``[r32]`` or ``[r64]`` addressing mode which has the shortest encoding. +After laying out members of a structure into cachelines for performance, +move most often used member of the first cacheline to the very beginning. + +Done that, pay attention to bytes 1--127. Members placed there will be encoded +with ``[r64+disp8]`` encoding (or ``[r32+disp8]`` on i386). This is only 1 byte +longer than encoding used for the first member but 3 bytes _shorter_ than +``[r64+disp32]`` used for all other members. Try to shift more often used +members into first 2 cachelines. + +"Refugee" members living in byte 128 and beyond can be placed in any order. + + +### b) Implicit 32/64-bit casts + +Avoid casts which change signedness and/or bitness of a value. + +If some piece of data appears in the code it generally should be kept in its +original type unless there are specific reasons to do otherwise (packing, etc). +With C's seemingly arcane implicit and explicit casting rules this is good advice +from programming language point of view as well. + +Given the code: + +.. code-block:: c + + void f(size_t); + + int len = strlen(s); + f(len); + +if compiler doesn't or can't maintain value ranges through casts it will have +no choice but to assume that all "size_t" values are possible and emit MOVSX +instruction: + +.. code-block:: none + + mov rdi, ... + call strlen + movsx rdi, eax + call f + +MOVSX by itself it not a problem but it a) may be 1 byte longer than MOV +instruction with same arguments and b) it won't be handled by register renaming, +increasing dependency chains by 1 instruction. + + +### c) 64-bitness ### + +64-bit instruction are 1-byte longer than corresponding 32-bit equivalents +on x86_64. + +There is one big 64-bit enabler which is dynamic memory allocation: all +kmalloc variant accept ``size_t`` and ``sizeof`` operator returns ``size_t``. + +Do not use 64-bit/``size_t`` unless strictly necessary (pointer-to-integer +conversion, syscall ABI interfaces, integers which can be genuinely big on +big machines, statistics). + +Use 32-bit ``unsigned int``. Kernel simply doesn't to individual 4+ GB +allocations and if it does it probably goes via page allocator. Such huge +amounts of memory simply aren't needed: network doesn't do gigabyte packets, +VFS caps IO at 2 GB minus a little and interating with userspace via +``copy_from_user``/``copy_to_user`` is capped at ``INT_MAX`` as well. + +.. code-block:: c + + #define MAX_RW_COUNT (INT_MAX & PAGE_MASK) + +The only exceptional case is ``size_t`` value being passed directly into +a standard function accepting ``size_t`` (``memset``, ``memcpy``, ...). +Truncating value to 32-bit won't do anything useful in this case. + + +### d) 16-bitness ### + +16-bit instructions will generate 1-byte operand size override prefix (66) +which again bloats an instruction by 1 byte. Unlike REX prefixes, this is +unavoidable. + +It is better to use 16-bit types at ABI/protocol/memory level, convert +to plain ``int``/``unsigned int`` as soon as possible and work with that. + +Preferred order of bitness on x86_64 is: + + 32/8-bit > 64-bit > 16-bit. + +3) Architecture specific issues (arm/arm64) +------------------------------------------- + +### Constant flags value selection ### + +"Tight" constants can be loaded into a register in 1 instruction on arm and +other RISC architectures: + +.. code-block:: c + + int f() + { + return 1; + } + +.. code-block:: none + + 00000000 <f>: + 0: e3a00001 mov r0, #1 + 4: e12fff1e bx lr + +Constants which don't fit into 12-bit window on arm will be loaded from memory +or constructed with 2 loads: + +.. code-block:: none + + 00000000 <f>: + 0: e59f0000 ldr r0, [pc] ; 8 <f+0x8> + 4: e12fff1e bx lr + 8: 00000801 .word 0x00000801 ; <=== 2049 + +After settling on flags/constants push often used values together bitwise.