Hiroshi Shimamoto <h-shimamoto@xxxxxxxxxxxxx> writes: > I noticed that the stack usage of the code gcc-4.x generated looks inefficient > on x86 and x86_64. I found this looking the assemble code of Linux kernel. > Is this inefficient stack usage a regression? It does seem to be a regression in this case. It seems to be the result of the tree reassociation pass. That pass reassociates the trees in order to expose redundancies which can then be eliminated. Your code ties all the expressions together via the | operation, and those all get sorted together. This increases the live length of the operands, and nothing ever fixes it up. > I made a simple test case. Note that your test case is wrong. > #define copy_from_asm(x, addr, err) \ > asm volatile( \ > "1:\tmovl %2, %1\n" \ > "2:\n" \ > ".section .fixup,\"ax\"\n" \ > "\txor %1,%1\n" \ > "\tmov $1,%0\n" \ > "\tjmp 2b\n" \ > ".previous\n" \ > : "=r" (err), "=r" (x) \ > : "m" (*(int*)(addr))) This says that it sets "err", but it doesn't always do so. I modified the last line to this: : "m" (*(int*)(addr)), "0" (err)) which ensures that the register holding 'err' is initialized. Please feel free to report a bug; see http://gcc.gnu.org/bugs.html . Note that your code relies on the fact that the asm does not change err in the normal case. You will get much better code if you take advantage of that fact: #define copy_from(x, addr, err) do { \ copy_from_asm((x), (addr), (err)); \ } while (0) #define copy(x, addr, err) ({ \ copy_from((x), (addr), err); \ }) #define my_copy(x) do { copy(dst[x], &src[x], err); } while (0) Ian