From: Maciej ?enczykowski <maze@xxxxxxx> > That's a good point - does anyone know what the new Intel > Virtualization thingamajig in the new dual core pentium D's is about? It's all speculation at this point. But there are _several_ factors. But I'm sure the first time Intel saw AMD's x86-64/PAE52 presentation, the same thing popped into my mind that popped into Intel's mind ... Virtualization - The 48-bit/256TiB limitation of x86-64 "Long Mode" There is a "progammers limit" of 48-bit/256TiB in x86-64 "Long Mode." This limitation is due to how i386/i486-TLB works -- 16-bit segment, 32-bit off-set. If AMD would have choosen to ignore such compatibility, it would have been near-impossible for 32-bit/PAE36 programs to run under a kernel of a different model. But "Long Mode" was designed so its PAE52 model could run both 32-bit (and PAE36) as well as new 48-bit programs. We'll revisit that in a bit. Now, let's talk about Intel/AMD design lineage. - Intel IA-32 Complete Design Lineage IA-32 Gen 1 (1986): i386, including i486 - Non-superscalar: ALU + optional FPU (std. in 486DX), TLB added in i486 IA-32 Gen 2 (1992): i586, Pentium/MMX (defunct, redesigned in i686) - Superscalar 2+1 ALU+FPU (pipelined) IA-32 Gen 3 (1994): i686, Pentium Pro, II, III, 4 (partial refit) - Superscalar: 2+2 ALU+FPU (pipelined), FPU 1 complex or 2 ADD - P3 = +1 SSE pipe, P4 = +2 SSE pipe Intel hasn't revamped it's aging i686 architecture in almost 12 years. the Pentium Pro through Pentium III are the _exact_same_ 7-issue (2+2+3 ALU+FPU+controll) design (the P3 slaps on one SSE unit), and the Pentium 4 was a quick, 18-month refit of longer pipes (with associated reduction in ALU/FPU performance MHz for MHz) that extended pipes for clock (and added a 2nd SSE unit). I'm sure Intel's reasoning for not bothering with a complete generation redesign beyond i686 is because it thought EPIC/Predication would have taken over by now. The reality has been quite the opposite (which I won't get back into). Since then, Intel has made a number of "hacks" to the i686 architecture. One is HyperThreading which tries to keep its pipes full by using its control units to virtualize two instruction schedulers, registers, etc... In a nutshell, it's a nice way to get "out-of-order and register renaming for almost free." Other than basic coherency checking as necessary in silicon, it "passes the buck" to the OS, leveraging its context switching (and associated overhead) to manage some details. That's why HyperThreading can actually be slower for some applications, because they do not thread, and the added overhead in _software_ results in reduced processing time for the applications. "Yamhill" IA-32e aka "EM64T" was just a P4 ALU refit for x86-64/PAE52, but it lacks many design considerations that the Athlon has -- especially outside the programmer/software considerations, and definitely more at the core interconnect/platform.. I.e., because Intel continues to use a single-point-of-contention "memory controller hub" (MCH), memory interconnect and I/O management, among other details, are still left to the MCH. This is going to become more and more of a headache. The reality is that the Intel IA-32e platform _must_ get past the "northbridge outside the CPU" attitude to compete with AMD. As such, I have _always_ theorized that "Yamhill" is a 2-part project. Part 2 is the first redesign of a x86 core in almost (now) 12 years, which goes beyond merely adding true register renaming and out-of- order execution (which are largely hacks in the P4/HT), but goes directly to the concept of virtualizing cores. More on that in a bit, now AMD ... - AMD x86 Complete Design Lineage AMD Gen 1 (1992*): i386/486 ISA -- 386, 486, 5x86, K5* - Non-superscalar: ALU + optional FPU (std. in K5) AMD Gen 2 (1994*): i486/686 ISA -- Nx586+FPU/K5*, Nx686/K6 - Superscalar: 3+1 ALU+FPU (ALUs pipelined, FPU _not_ piplined) AMD Gen 3 (1999): i686/x86-64 ISA -- Athlon, Athlon64/Opteron - Superscalar: 3+3 ALU+FPU (pipelined), FPU 2 _and_ 1 ADD/MULT - Extensions are microcoded and leverage ALU/FPU as appropriate *NOTE: The NexGen Nx586 released in 1994 forms the basis for latter K5 (i486) and the K6 (i686). AMD had scalability issues with its original non-superscalar K5 design and purchased NexGen. SIDE NOTE: SSE Comparison - P4 can do 3 MULT SSE (1 FPU complex + 2 SSE pipes) - Athlon can do 3 MULT SSE (2 FPU complex + 1 FPU MULT) Contrary to popular opinion, Athlon64/Opteron is the _same_core_ design as the 32-bit Athlon platform. It is still the same, ultra- powerful 3+3 ALU+FPU core, with its 2 complex + 1 ADD/MULT FPU able to equal Intel's 1 complex _or_ 2 ADD FPU plus 2 SSE pipes at doing the majority of matrix transforms (which are MULT -- hence why Intel's FPU can't do 2 simultaneously, and relies heavily on its precision-lacking SSE pipes). Also contrary to popular opinion, 40-bit/1TiB Digital Alpha EV6 interconnect forms the basis for _all_ addressing in _all_ Athlon releases, including the 32-bit. There are a few mainboards that allow even 32-bit Athlons to safely address above 4GB with _no_ paging or issues (with an OS that offers such a supporting kernel, like Linux). The 3-16 point EV6 crossbar and not "hub" architecture, forced Athlon MP to put any I/O coherency login in the chip, so the AGPgart control is actually on the Athlon MP, and not in the northbridge. This has evolved into a full I/O MMU in Athlon64/Opteron. Because Athlon is 5 years newer than Intel i686, and there is a wealthy of talent influx from Digital (even though Intel did get some as well, they haven't redesigned i686 completely), Athlon has some of the latest, run-time register renaming and out-of-order execution control in the core itself. This is why doing something like HyperThreading would benefit AMD _very_little_ and largley introduce self-defeating (and even performance reducing) overhead. In addition to the design of PAE52, the #1 reason why you can safely assume AMD is moving towards virtualization is because of the design limits they put on Athlon64/Opteron. E.g., although the original 32-bit Athlon platform used logic that allowed up to the full EV6 8MB SRAM addressing (cache), Athlon64/Opteron has been artificially limited to 1MB SRAM (saving many considerations and other benefits). This clearly indicates AMD did not consider Athlon64/Opteron - The Evolution to Virtual Cores AMD's adoption of '90s concepts of register renaming and out-of-order execution are great for a single core. And Intel's HyperThreading with the minor P4 run-time additions passes-the-buck decently in lieu of a complete core redesign (which they haven't done since 1994). But the concept of extending the pipes any further for performance has been largely broken in the P4, and Intel is actually falling back to its last rev of the i686 original, P3. Multiple, _physical_ cores have been the first step. This is little more than slapping in a second set of all the non-SRAM transistors, plus any additional bridging logic, if necessary. AMD HyperTransport requires none -- as HyperTransport can "tunnel" anything, EV6 memory/addressing, I/O tunnels/bridges, inter-CPU, etc... all "gluelessly." Intel MCH GTL+ cannot, and requires bridges between the "chipset MCH" and the "multi-core MCH," adding latency. And there are nagging 32-bit limitations with GTL+ as well (long story). The next logical evolution in microprocessor design is to blur the physical separation between cores. It's the best way without tearing down the entire '70s-induced concept of machine code (operator+ operand, possibly control, at least microcoded internally) and the resulting instruction sets. Instead of discrete, superscalar units of a half-dozen to a dozen, pipelined units, there will be numerous, independent pipes, possibly with their own registers or a number of generic registers, as a single unit. Other than the controlling firmware and/or OS, this is _not_ what software will use. What the software will use are the virtual instantiations that partition this set of pipes and registers, which may very well be dynamic in nature. Let's say I boot Windows, I might instantiate a virtual i686/PAE36 core guaranteeing 100% full Win32 compatibility. Depending on what resources the chip physically has, I will likely even instantiate multiple i686 processors. The concept of multi-CPU and multi-threading has evolved into virtual-cores with virtual-threading. Virtualizing more CPUs with a total number of more pipes/registers than is actual will allow more registers and pipelines to be executing instead of the common 40-50% for superscalar CISC or 60-70% for superscalar RISC. As an "added bonus," this means the 48-bit/256TiB constraint for PAE36 compatibility is _removed_. I.e., you can have a much larger, true memory pool, and any required windowing/ segmentation is done with_out_ paging by the "host" memory model, even though the OS is virtually running in a PAE36 or PAE52 model. This also gives rise to an entirely new platform for virtualization of simultaneous OSes -- be it the same OS, or different OSes. Because cores are virtual, you can have multiple, independent processors with their own registers, memory windows into physical RAM, etc... On the more "consumer" front, this will allow it to work with existing OSes as-is. On the more "load balancing server" front, this will often be paired with software (think EMC/VMWare *SX products) so numerous instances can be dynamically load-balanced across virtual cores -- but far more overhead and increased efficiency is put on the chip. But it is still managed by software (just with reduced context switching overhead in the software). Again, it's really just a consolidation of all the run-time optimizations we have now, along with both multi-core and multi-threading approaches, into a general pool of pipes, registers and organization. Additionally, it breaks the physical constraints of the memory model for the physical hardware, which is a very big issue for our future. To ensure x86/PAE36 and x86-64/PAE52 compatibility in the future, such machines will need to be virtualized or we'll be stuck at 48-bit/256TiB. > As in is it worth anything? Yes -- and almost everything to the future of Microsoft being able to sustain much their existing Win32 codebase which does _not_ port to PAE52 very easily and definitely _not_ with full compatibility. And we have to break the 48-bit/256TiB limitations of PAE52, while still ensuring PAE52 OSes/applications, as well as some legacy PAE36 OSes/applications, still run. The only way is to virtualize the whole freak'n chip so we can instantiate a processor, registers and its memory model -- even if dynamically assigned/ shared. And that's just for end-users, possibly workstations and entry servers. For load-balancing servers, you'll still need a software solution for management. It will be that the hardware just offers far greater efficiency and reduced context switching. In fact, the next consolidation are these virtual core chips in blades, where you not only manage the virtual cores in the individual chips/ blades, but an entire rack of blades as a single unit with multiple OSes spread across. This already exists, but this takes it one step further -- because the processors themselves are virtualized with greatly reduced overhead on the part of the software. > Will it allow a dual simultaneous boot of Linux+WinXP+MaxOS > under Xen or something along those lines? Yes. It will both give more virtualized processors to a single executing OS, as well as create segmented, virtualized processors for independently and simultaneously operating processors. > Even on an SMP machine? First off, remove the Intel-centric notion of "Symmetric" MP (SMP). Secondly, multi-processing and multi-threading are going to merge with traditional register renaming and out-of-order execution. So the traditional concept of "MP" is _dying_. In fact, in the '90s, it really died in general. I know it's hard to think outside the box and traditional thought, but most users don't understand superscalar design in the first place. Those who do understand why AMD has _not_ bothered to adopt Intel SMT (HyperThreading) in Athlon, because it won't benefit (because AMD's cores are 5 years newer in design, and put far more optimizations in the chip to keep pipes full and registers used that to virtualize two sets for the OS to use). > Anyone have any experience/knowledge about this? I can only speculate based on the history of the players involved, as well as what AMD's PAE52 design as well as limitations of the current Athlon core (which is largely the _same_ between both the 32-bit and newer 64-bit versions). But the concept of adding more pipes with lots of stages for timing is only leaving more and more stages in pipes empty, or doing little. There has to be a consolidation of many run-time optimizations inside of the chip, and the best way to do that is to create a bank of pipes, registers, etc... and virtually assemble them into virtual cores that are partitioned with memory as a traditional PAE36 or PAE52 processor (or multi-processor). It's going to solve a _lot_ of issues -- both semiconductor and software. > What level of CPU/hardware(?) does the virt-core support? > And is the virt-core 32bit? You can be certain that the "host" OS (possibly firmware-based?) will be able to instantiate multiple PAE36 and/or PAE52 virtual systems with their own and -- I'll use legacy terminology here (even if it's not technically correct) -- "Ring 0" access. So, technically, there should be possible to run any PAE36 or PAE52 OS simultaneously on the same hardware as any other PAE36 or PAE52 OS. The larger issues of firmware-OS interoperability as well as partitioning resources (memory, disk, etc...) is really more of a political/market issue. I.e., AMD and Intel can provide the platform, but people have to work together to use it. Furthermore, it also means that Intel can continue to best AMD in funding of OEMs and firmware/software vendors, so it still has an advantage in that capacity. I'm sure Apple will be protective of its firmware, and Intel's new, supposed "open" firmware is rather proprietary. As I've repeatedly commented elsewhere, the 2 "most open" hardware vendors right now are AMD and Sun, x86-64 and SPARC, respectively. Intel has not only protected non-programmer aspects of IA-64 heavily, but most of their new platform developments for even IA-32e (EM64T) are _very_proprietary_. IBM is partially doing the same with Power in a microelectronics offering, but it is _not_ the same in its branded Power solutions (among others). So it's not going to solve vendors who require firmware and data organization that is not open and stanardized. We're fine on legacy Win32 platforms, but it's not going to address Mactel, nor solve the problem of existing OSes that don't run under current virtualization solutions because of such proprietary requirements. -- Bryan J. Smith mailto:b.j.smith@xxxxxxxx