Resuming development of Synthetic Registers (really long)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



*Back to Synthetic Registers*

This is a request for guidance and assistance in implementing an unusual performance approach using Synthetic Registers.

I am modifying the x86-32 backend of GCC to add Synthetic Registers. If I am mistaken, this is a big waste of time. If I am correct, (meaning "really, really lucky") then this may provide a significant increase in speed for executables. Long ago, I found a description on the internet of someone else who tried this on a different compiler, and achieved speed-ups. Some increases were 148%. Unfortunately, I have not been able to rediscover that description.

Synthetic Registers are memory locations that are treated, and described to the compiler, as registers.

Each Synthetic Register is a Word of memory, defined to the compiler as a register. If my readings in machine architecture are correct, then for modern machines, L1 cache access is as fast as register access. Because the Synthetic Registers will be frequently accessed, they will remain in L1 cache, and give register-like performance. The compiler is "told" that the Synthetic Registers must be accessed with their own instructions. In reality, those special instructions are mostly-specified "modrm" instructions. I am using "modrm" instructions to minimize instruction length and to decrease complexity.

The concept was first proposed by Lal George, one of the creators of the IRC (Iterated Register Coalescing) algorithm for color-graphing. Quoting Mr. George, from the synopsis [bolding by me]

   “Thus, for the x86, the conceptual model of the architecture has
   been extended with a set of memory locations that are treated as
   registers. The net effect is one where temporaries are computed into
   memory locations and passed as arguments to functions. *The use of
   these memory locations managed in this way can be as fast as using
   registers, where the register allocation algorithm is indirectly
   taking the hardware register renaming mechanism into account*.”


SMLNJ: Intel x86 back end Compiler Controlled Memory (1999)
Lal George

Conceptually, Synthetic Registers are similar to actual registers, with many of the same characteristics. Clearly, this cannot include basing or indexing, because of the addressing modes on the x86 architecture.

Synthetic Registers are not "virtual", because they are not a simulation of a register, using a lot of instructions to perform a simple operation.

Synthetic Registers are not "pseudo" because they are not temporary or symbolic representations of real registers. Once implemented, they are every bit as real as the “real” registers.

Because of the x86 architecture, they may be used with many of the same instructions for "real" registers, with the same results. Because of the narrow way they are defined, they will operate as fast as "real" registers.

These days, with register-renaming, even the real registers themselves are somewhat ephemeral.

Synthetic Registers are an imperfect answer. They are limited to binary-type instructions. They cannot be used as base or index registers. They cannot be directly copied to memory.

The right answer was finally implemented in the 64-bit architectures – just implement more registers.

If the Synthetic Registers are properly defined to the compiler, the compiler will believe that it has more useable registers, and will use pre-existing optimization mechanisms for registers, without further changes.

The philosophy is straightforward. I believe that GCC could handle intermediate results more efficiently. Register based values are now handled very efficiently. Memory based values are handled less efficiently. As far as I know, there is no cost distinction between the first slot on the stack and the last. I believe that for long functions, more than a hundred lines of code, GCC compiled code spends much of its time recreating intermediate results again and again. This is because there is no place to store intermediate results, other than in the registers, and so the results are discarded and recreated.

Register allocation schemes, including Color-graphing, work poorly, the fewer registers there are available for use. Six registers are sub-optimal.

Again, Lal George: “The standard Chaitin graph coloring register allocation cannot be used directly for machines with few registers, as all temporaries wind up being spilled, making for a poor Allocation.”

I first proposed this on the GCC mailing list on Friday, December 27, 2002.

We discussed it on the GCC mailing list for two weeks, and I received a number of helpful emails. At the end of the two weeks, my workload significantly increased and I had to drop it at the time. Those who kindly responded were:

Daniel Berlin Kevin Handy Diego Novillo
Paolo Bonzini Dave Hudson Alexandre Oliva
Denis Chertykov Janis Johnson Toshi
Marcel Cox Chris Lattner Zack Weinberg
Robert Dewar Tom Lord
Daniel Egger James Mansion


I now have time available and want to pick up where I left off.

A search using the keywords “synthetic registers andy walker” retrieved all the relevant messages in the GCC archive.

My concept was original, but not novel. Chris Lattner directed me to the paper by Lal George.

From Mr. George’s paper, it appears that his implementation of Synthetic Registers was successful.

Other posters also validated the concept.

On the GCC mailing list, Dave Hudson responded with this:

       [Paolo Bonzini]
       That's not unusual. If one had, say, to write a back-end for the
       6502, one
       might think of representing the zero-page as caller-save
       registers, and hide the weeny three registers A,X,Y from the
       middle-end (using them only
       internally, or very near to this). …


   This strategy is exactly what the IP2k port of gcc does. We use 32
   directly addressable memory locations as our common "registers" and
   expose the two pointer and stack pointer regs. <snip>

   It's not pretty, but it does work very well.


Michael S. Zick, responding to another of my comments, said this:

   The argument I presented above is not 100% academic...

   I have actually done (in a non-compiler context) such x86
   programming, with the added decoration of making the primary
   registers 10 bytes long so that they could hold any FPU/ALU data type.


Daniel Egger said: “if you have a look at register rich architectures like PPC you'll see that the generated code rarely exceeds a number of 16 used registers, most of them being used for parameters and return values.”

The instructions that can operate on a Synthetic Register are: MOV, ADD, SUB, NEG, AND, OR, XOR, ROR, ROL, ASHL, ASHR, and LSHR. In addition, a synthetic register may be a source for a MUL or DIVMOD. For this short list of instructions, a Synthetic register can be a source or destination, in conjunction with a register. It may also be a destination in conjunction with an immediate value.

My own assembler experience is from application development on IBM Mainframes - the 360/370/390 series. On the s/390 equipment, we had 16 General Purpose registers, and that was half what we really needed. I find the thought of only 8, or really, 6 general-purpose registers being available for the compiler to be just ridiculous.

I intend to change the stack layout, in spite of a sage comment by Marcel Cox about the ABI – “any change to the stack layout that would lead to ABI incompatibilities would certainly not have a chance of ever being accepted.”

My first thought is to use the frame pointer as the base pointer for the Synthetic Registers.

My first step is simply to add storage on the stack for a number of general-purpose Synthetic Registers, but not use them.

The contiguous block of Synthetic Registers is aligned on whatever block boundary is used by the L1 cache. There must be real memory somewhere from which the L1 cache will be drawn. I am implementing it as an extra block of bytes in the frame. This means that every function will have its own block of Synthetic Registers and there will be no need to save or restore Synthetic Registers. Synthetic Registers will not be clobbered by function calls.

If the call stack is 10 calls deep, we get 10 different sets of Synthetic Registers. Also, a function compiled with Synthetic Registers can be linked to, and called from, a function compiled without Synthetic Registers, with no additional complexity.

Once this part is working, I want to look into varying the number of Synthetic Registers, depending upon the needs of the compiler for a particular function. I am also quite curious about the value of Synthetic floating-point registers.

I will set up the allocation order so that the Synthetic Registers are allocated before the general purpose registers. This way, general-purpose registers are reserved for situations where Synthetic Registers are not appropriate.

Initially, all of my changes will be to files: ix86.c, ix86.h, ix86.md, ix86-protos.h, and config.gcc. No changes to any register allocating or scheduling routines or files.

In ix86.h, I am declaring the first pseudo-register to be number 84.

The draft code changes that I made years ago have been lost, so I am starting from scratch again.

Initial question: any obvious landmines in changing the prologue and epilogue to stick in an additional 40 bytes on the stack?

Andy



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux