*Back to Synthetic Registers*
This is a request for guidance and assistance in implementing an unusual
performance approach using Synthetic Registers.
I am modifying the x86-32 backend of GCC to add Synthetic Registers. If
I am mistaken, this is a big waste of time. If I am correct, (meaning
"really, really lucky") then this may provide a significant increase in
speed for executables. Long ago, I found a description on the internet
of someone else who tried this on a different compiler, and achieved
speed-ups. Some increases were 148%. Unfortunately, I have not been able
to rediscover that description.
Synthetic Registers are memory locations that are treated, and described
to the compiler, as registers.
Each Synthetic Register is a Word of memory, defined to the compiler as
a register. If my readings in machine architecture are correct, then for
modern machines, L1 cache access is as fast as register access. Because
the Synthetic Registers will be frequently accessed, they will remain in
L1 cache, and give register-like performance. The compiler is "told"
that the Synthetic Registers must be accessed with their own
instructions. In reality, those special instructions are
mostly-specified "modrm" instructions. I am using "modrm" instructions
to minimize instruction length and to decrease complexity.
The concept was first proposed by Lal George, one of the creators of the
IRC (Iterated Register Coalescing) algorithm for color-graphing. Quoting
Mr. George, from the synopsis [bolding by me]
“Thus, for the x86, the conceptual model of the architecture has
been extended with a set of memory locations that are treated as
registers. The net effect is one where temporaries are computed into
memory locations and passed as arguments to functions. *The use of
these memory locations managed in this way can be as fast as using
registers, where the register allocation algorithm is indirectly
taking the hardware register renaming mechanism into account*.”
SMLNJ: Intel x86 back end Compiler Controlled Memory (1999)
Lal George
Conceptually, Synthetic Registers are similar to actual registers, with
many of the same characteristics. Clearly, this cannot include basing or
indexing, because of the addressing modes on the x86 architecture.
Synthetic Registers are not "virtual", because they are not a simulation
of a register, using a lot of instructions to perform a simple operation.
Synthetic Registers are not "pseudo" because they are not temporary or
symbolic representations of real registers. Once implemented, they are
every bit as real as the “real” registers.
Because of the x86 architecture, they may be used with many of the same
instructions for "real" registers, with the same results. Because of the
narrow way they are defined, they will operate as fast as "real" registers.
These days, with register-renaming, even the real registers themselves
are somewhat ephemeral.
Synthetic Registers are an imperfect answer. They are limited to
binary-type instructions. They cannot be used as base or index
registers. They cannot be directly copied to memory.
The right answer was finally implemented in the 64-bit architectures –
just implement more registers.
If the Synthetic Registers are properly defined to the compiler, the
compiler will believe that it has more useable registers, and will use
pre-existing optimization mechanisms for registers, without further
changes.
The philosophy is straightforward. I believe that GCC could handle
intermediate results more efficiently. Register based values are now
handled very efficiently. Memory based values are handled less
efficiently. As far as I know, there is no cost distinction between the
first slot on the stack and the last. I believe that for long functions,
more than a hundred lines of code, GCC compiled code spends much of its
time recreating intermediate results again and again. This is because
there is no place to store intermediate results, other than in the
registers, and so the results are discarded and recreated.
Register allocation schemes, including Color-graphing, work poorly, the
fewer registers there are available for use. Six registers are sub-optimal.
Again, Lal George: “The standard Chaitin graph coloring register
allocation cannot be used directly for machines with few registers, as
all temporaries wind up being spilled, making for a poor Allocation.”
I first proposed this on the GCC mailing list on Friday, December 27, 2002.
We discussed it on the GCC mailing list for two weeks, and I received a
number of helpful emails. At the end of the two weeks, my workload
significantly increased and I had to drop it at the time. Those who
kindly responded were:
Daniel Berlin Kevin Handy Diego Novillo
Paolo Bonzini Dave Hudson Alexandre Oliva
Denis Chertykov Janis Johnson Toshi
Marcel Cox Chris Lattner Zack Weinberg
Robert Dewar Tom Lord
Daniel Egger James Mansion
I now have time available and want to pick up where I left off.
A search using the keywords “synthetic registers andy walker” retrieved
all the relevant messages in the GCC archive.
My concept was original, but not novel. Chris Lattner directed me to the
paper by Lal George.
From Mr. George’s paper, it appears that his implementation of
Synthetic Registers was successful.
Other posters also validated the concept.
On the GCC mailing list, Dave Hudson responded with this:
[Paolo Bonzini]
That's not unusual. If one had, say, to write a back-end for the
6502, one
might think of representing the zero-page as caller-save
registers, and hide the weeny three registers A,X,Y from the
middle-end (using them only
internally, or very near to this). …
This strategy is exactly what the IP2k port of gcc does. We use 32
directly addressable memory locations as our common "registers" and
expose the two pointer and stack pointer regs. <snip>
It's not pretty, but it does work very well.
Michael S. Zick, responding to another of my comments, said this:
The argument I presented above is not 100% academic...
I have actually done (in a non-compiler context) such x86
programming, with the added decoration of making the primary
registers 10 bytes long so that they could hold any FPU/ALU data type.
Daniel Egger said: “if you have a look at register rich architectures
like PPC you'll see that the generated code rarely exceeds a number of
16 used registers, most of them being used for parameters and return
values.”
The instructions that can operate on a Synthetic Register are: MOV, ADD,
SUB, NEG, AND, OR, XOR, ROR, ROL, ASHL, ASHR, and LSHR. In addition, a
synthetic register may be a source for a MUL or DIVMOD. For this short
list of instructions, a Synthetic register can be a source or
destination, in conjunction with a register. It may also be a
destination in conjunction with an immediate value.
My own assembler experience is from application development on IBM
Mainframes - the 360/370/390 series. On the s/390 equipment, we had 16
General Purpose registers, and that was half what we really needed. I
find the thought of only 8, or really, 6 general-purpose registers being
available for the compiler to be just ridiculous.
I intend to change the stack layout, in spite of a sage comment by
Marcel Cox about the ABI – “any change to the stack layout that would
lead to ABI incompatibilities would certainly not have a chance of ever
being accepted.”
My first thought is to use the frame pointer as the base pointer for the
Synthetic Registers.
My first step is simply to add storage on the stack for a number of
general-purpose Synthetic Registers, but not use them.
The contiguous block of Synthetic Registers is aligned on whatever block
boundary is used by the L1 cache. There must be real memory somewhere
from which the L1 cache will be drawn. I am implementing it as an extra
block of bytes in the frame. This means that every function will have
its own block of Synthetic Registers and there will be no need to save
or restore Synthetic Registers. Synthetic Registers will not be
clobbered by function calls.
If the call stack is 10 calls deep, we get 10 different sets of
Synthetic Registers. Also, a function compiled with Synthetic Registers
can be linked to, and called from, a function compiled without Synthetic
Registers, with no additional complexity.
Once this part is working, I want to look into varying the number of
Synthetic Registers, depending upon the needs of the compiler for a
particular function. I am also quite curious about the value of
Synthetic floating-point registers.
I will set up the allocation order so that the Synthetic Registers are
allocated before the general purpose registers. This way,
general-purpose registers are reserved for situations where Synthetic
Registers are not appropriate.
Initially, all of my changes will be to files: ix86.c, ix86.h, ix86.md,
ix86-protos.h, and config.gcc. No changes to any register allocating or
scheduling routines or files.
In ix86.h, I am declaring the first pseudo-register to be number 84.
The draft code changes that I made years ago have been lost, so I am
starting from scratch again.
Initial question: any obvious landmines in changing the prologue and
epilogue to stick in an additional 40 bytes on the stack?
Andy