Resuming development of Synthetic Registers (really long)

Andy Walker <ja_walker@xxxxxxxxxxxxx> · Fri, 17 Sep 2010 19:02:54 -0500

*Back to Synthetic Registers*

This is a request for guidance and assistance in implementing an unusual 
performance approach using Synthetic Registers.

I am modifying the x86-32 backend of GCC to add Synthetic Registers. If 
I am mistaken, this is a big waste of time. If I am correct, (meaning 
"really, really lucky") then this may provide a significant increase in 
speed for executables. Long ago, I found a description on the internet 
of someone else who tried this on a different compiler, and achieved 
speed-ups. Some increases were 148%. Unfortunately, I have not been able 
to rediscover that description.

Synthetic Registers are memory locations that are treated, and described 
to the compiler, as registers.

Each Synthetic Register is a Word of memory, defined to the compiler as 
a register. If my readings in machine architecture are correct, then for 
modern machines, L1 cache access is as fast as register access. Because 
the Synthetic Registers will be frequently accessed, they will remain in 
L1 cache, and give register-like performance. The compiler is "told" 
that the Synthetic Registers must be accessed with their own 
instructions. In reality, those special instructions are 
mostly-specified "modrm" instructions. I am using "modrm" instructions 
to minimize instruction length and to decrease complexity.

The concept was first proposed by Lal George, one of the creators of the 
IRC (Iterated Register Coalescing) algorithm for color-graphing. Quoting 
Mr. George, from the synopsis [bolding by me]

   “Thus, for the x86, the conceptual model of the architecture has
   been extended with a set of memory locations that are treated as
   registers. The net effect is one where temporaries are computed into
   memory locations and passed as arguments to functions. *The use of
   these memory locations managed in this way can be as fast as using
   registers, where the register allocation algorithm is indirectly
   taking the hardware register renaming mechanism into account*.”

SMLNJ: Intel x86 back end Compiler Controlled Memory (1999)
Lal George

Conceptually, Synthetic Registers are similar to actual registers, with 
many of the same characteristics. Clearly, this cannot include basing or 
indexing, because of the addressing modes on the x86 architecture.

Synthetic Registers are not "virtual", because they are not a simulation 
of a register, using a lot of instructions to perform a simple operation.

Synthetic Registers are not "pseudo" because they are not temporary or 
symbolic representations of real registers. Once implemented, they are 
every bit as real as the “real” registers.

Because of the x86 architecture, they may be used with many of the same 
instructions for "real" registers, with the same results. Because of the 
narrow way they are defined, they will operate as fast as "real" registers.

These days, with register-renaming, even the real registers themselves 
are somewhat ephemeral.

Synthetic Registers are an imperfect answer. They are limited to 
binary-type instructions. They cannot be used as base or index 
registers. They cannot be directly copied to memory.

The right answer was finally implemented in the 64-bit architectures – 
just implement more registers.

If the Synthetic Registers are properly defined to the compiler, the 
compiler will believe that it has more useable registers, and will use 
pre-existing optimization mechanisms for registers, without further 
changes.

The philosophy is straightforward. I believe that GCC could handle 
intermediate results more efficiently. Register based values are now 
handled very efficiently. Memory based values are handled less 
efficiently. As far as I know, there is no cost distinction between the 
first slot on the stack and the last. I believe that for long functions, 
more than a hundred lines of code, GCC compiled code spends much of its 
time recreating intermediate results again and again. This is because 
there is no place to store intermediate results, other than in the 
registers, and so the results are discarded and recreated.

Register allocation schemes, including Color-graphing, work poorly, the 
fewer registers there are available for use. Six registers are sub-optimal.

Again, Lal George: “The standard Chaitin graph coloring register 
allocation cannot be used directly for machines with few registers, as 
all temporaries wind up being spilled, making for a poor Allocation.”

I first proposed this on the GCC mailing list on Friday, December 27, 2002.

We discussed it on the GCC mailing list for two weeks, and I received a 
number of helpful emails. At the end of the two weeks, my workload 
significantly increased and I had to drop it at the time. Those who 
kindly responded were:

Daniel Berlin Kevin Handy Diego Novillo
Paolo Bonzini Dave Hudson Alexandre Oliva
Denis Chertykov Janis Johnson Toshi
Marcel Cox Chris Lattner Zack Weinberg
Robert Dewar Tom Lord
Daniel Egger James Mansion

I now have time available and want to pick up where I left off.

A search using the keywords “synthetic registers andy walker” retrieved 
all the relevant messages in the GCC archive.

My concept was original, but not novel. Chris Lattner directed me to the 
paper by Lal George.

From Mr. George’s paper, it appears that his implementation of 
Synthetic Registers was successful.

Other posters also validated the concept.

On the GCC mailing list, Dave Hudson responded with this:

       [Paolo Bonzini]
       That's not unusual. If one had, say, to write a back-end for the
       6502, one
       might think of representing the zero-page as caller-save
       registers, and hide the weeny three registers A,X,Y from the
       middle-end (using them only
       internally, or very near to this). …

   This strategy is exactly what the IP2k port of gcc does. We use 32
   directly addressable memory locations as our common "registers" and
   expose the two pointer and stack pointer regs. <snip>

   It's not pretty, but it does work very well.

Michael S. Zick, responding to another of my comments, said this:

   The argument I presented above is not 100% academic...

   I have actually done (in a non-compiler context) such x86
   programming, with the added decoration of making the primary
   registers 10 bytes long so that they could hold any FPU/ALU data type.

Daniel Egger said: “if you have a look at register rich architectures 
like PPC you'll see that the generated code rarely exceeds a number of 
16 used registers, most of them being used for parameters and return 
values.”

The instructions that can operate on a Synthetic Register are: MOV, ADD, 
SUB, NEG, AND, OR, XOR, ROR, ROL, ASHL, ASHR, and LSHR. In addition, a 
synthetic register may be a source for a MUL or DIVMOD. For this short 
list of instructions, a Synthetic register can be a source or 
destination, in conjunction with a register. It may also be a 
destination in conjunction with an immediate value.

My own assembler experience is from application development on IBM 
Mainframes - the 360/370/390 series. On the s/390 equipment, we had 16 
General Purpose registers, and that was half what we really needed. I 
find the thought of only 8, or really, 6 general-purpose registers being 
available for the compiler to be just ridiculous.

I intend to change the stack layout, in spite of a sage comment by 
Marcel Cox about the ABI – “any change to the stack layout that would 
lead to ABI incompatibilities would certainly not have a chance of ever 
being accepted.”

My first thought is to use the frame pointer as the base pointer for the 
Synthetic Registers.

My first step is simply to add storage on the stack for a number of 
general-purpose Synthetic Registers, but not use them.

The contiguous block of Synthetic Registers is aligned on whatever block 
boundary is used by the L1 cache. There must be real memory somewhere 
from which the L1 cache will be drawn. I am implementing it as an extra 
block of bytes in the frame. This means that every function will have 
its own block of Synthetic Registers and there will be no need to save 
or restore Synthetic Registers. Synthetic Registers will not be 
clobbered by function calls.

If the call stack is 10 calls deep, we get 10 different sets of 
Synthetic Registers. Also, a function compiled with Synthetic Registers 
can be linked to, and called from, a function compiled without Synthetic 
Registers, with no additional complexity.

Once this part is working, I want to look into varying the number of 
Synthetic Registers, depending upon the needs of the compiler for a 
particular function. I am also quite curious about the value of 
Synthetic floating-point registers.

I will set up the allocation order so that the Synthetic Registers are 
allocated before the general purpose registers. This way, 
general-purpose registers are reserved for situations where Synthetic 
Registers are not appropriate.

Initially, all of my changes will be to files: ix86.c, ix86.h, ix86.md, 
ix86-protos.h, and config.gcc. No changes to any register allocating or 
scheduling routines or files.

In ix86.h, I am declaring the first pseudo-register to be number 84.

The draft code changes that I made years ago have been lost, so I am 
starting from scratch again.

Initial question: any obvious landmines in changing the prologue and 
epilogue to stick in an additional 40 bytes on the stack?

Andy