... > BTW off topic (but relevant to this patchset), I strongly feel that > routines like memset/memcpy are better coded in assembly for really > water tight instruction scheduling and ease of further optimizing (e.g. > use of CMO.zero etc as experimented by Philipp). What is blocking you > from optimizing the asm version ? You are leaving the fate of these > critical routines in the hand of compiler - this can lead to performance > shenanigans on a big gcc upgrade. You also need to worry about the cost of short transfers. A few cycles there could have a much bigger difference that something that speeds up long transfers. Short ones are likely to be fairly common. I doubt the loop unrolling optimisation in gcc is actually any good for loops that might be done a few times. Fortunately the kernel doesn't get 'hit by' gcc unrolling loops into the AVX instructions. The setup costs for that (and I-cache footprint) are horrid. Although I suspect it is that optimisation that 'broke' code that used misaligned pointers on overlapping data. It is a general problem with the 'one size fits all' memcpy(). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)