On Thu, Feb 6, 2020 at 1:26 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > I know x86_64 stack alignment is 16 bytes. That's true for the standard sysv ABI that is used in userspace; but the kernel uses a custom ABI with 8-byte stack alignment. See arch/x86/Makefile: # For gcc stack alignment is specified with -mpreferred-stack-boundary, # clang has the option -mstack-alignment for that purpose. ifneq ($(call cc-option, -mpreferred-stack-boundary=4),) cc_stack_align4 := -mpreferred-stack-boundary=2 cc_stack_align8 := -mpreferred-stack-boundary=3 else ifneq ($(call cc-option, -mstack-alignment=16),) cc_stack_align4 := -mstack-alignment=4 cc_stack_align8 := -mstack-alignment=8 endif [...] # By default gcc and clang use a stack alignment of 16 bytes for x86. # However the standard kernel entry on x86-64 leaves the stack on an # 8-byte boundary. If the compiler isn't informed about the actual # alignment it will generate extra alignment instructions for the # default alignment which keep the stack *mis*aligned. # Furthermore an alignment to the register width reduces stack usage # and the number of alignment instructions. KBUILD_CFLAGS += $(call cc-option,$(cc_stack_align8)) > I cannot find evidence for > what function start alignment should be. There is no architecturally required alignment for functions, but Intel's Optimization Manual (<https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf>) recommends in section 3.4.1.5, "Code Alignment": | Assembly/Compiler Coding Rule 12. (M impact, H generality) | All branch targets should be 16-byte aligned. AFAIK this is recommended because, as documented in section 2.3.2.1, "Legacy Decode Pipeline" (describing the frontend of Sandy Bridge, and used as the base for newer microarchitectures): | An instruction fetch is a 16-byte aligned lookup through the ITLB and into the instruction cache. | The instruction cache can deliver every cycle 16 bytes to the instruction pre-decoder. AFAIK this means that if a branch ends close to the end of a 16-byte block, the frontend is less efficient because it may have to run two instruction fetches before the first instruction can even be decoded.