IBM’s Hot Chips presentation on its forthcoming 45nm POWER7 server processor had a wealth of information on the chip, which, at 1.2 billion transistors and 567mm2, is actually quite svelte considering what it offers. The secret is the first use of a special cache technology that IBM has been touting since 2007, but more on that in a moment.
POWER7 will come in 4-, 6-, and 8-core varieties, with the default presumably being the 8-core and the lower-core variants being offered to improve yields. Each core features 4-way simultaneous multithreading, which means that the 8-core will support a total of 32 simultaneous threads per socket. POWER7 is designed for multisocket systems that scale up to 32 sockets, which means that a full 32-socket system of 8-core parts would support 1024 threads.
Feeding eight cores in a single socket is quite a challenge, which is why each POWER7 has a pair of four-channel DDR3 controllers that can support up to 100GB/s of sustained memory bandwidth. Also helping the situation is a whopping 32MB of on-die L3 cache—IBM was able to cram this much cache on there by using a special embedded DRAM (eDRAM) design that cuts the transistor cost of its large cache pool roughly in half.
To see how dramatic the transistor savings are this eDRAM cache scheme, compare the 8-core, 32MB-cache POWER7’s 1.2 billion transistor count with the 2 billion transistor count of the 4-core, 30MB-cache “Tukwila” Itanium from Intel. Sure, POWER7’s eDRAM is almost certainly a bit slower than Tukwila’s SRAM, but in today’s power-sensitive age that level of transistor savings is impressive. Also consider how POWER7 stacks up to the eight core Nehelem EX, which has 24MB of cache and weighs in at over 2.2 billion transistors; again, IBM did more with less.
Note that the four-way SMT design is another trick that helps with problem of feeding all that hardware by acting as a latency-hiding mechanism for each core’s back end. If one thread stalls waiting on memory, the core can (ideally) find instructions from another running thread to feed to the execution units in order to keep them busy. This bandwidth issue is probably one reason behind IBM’s decision to go with such a high level of SMT.
Speaking of a POWER7 core’s back end, each core contains a very robust suite of execution resources. There are 12 execution units in total, broken down as follows:
- 2 integer units
- 2 load-store units
- 4 double-precision floating-point units
- 1 branch unit
- 1 condition register unit
- 1 vector unit
- 1 decimal floating-point unit
Those of you who’ve read my past microprocessor articles or my book will know what most of the above units are for, with the possible exceptions of the PPC-specific condition register unit (that was present on the 970) and the decimal floating-point unit, which accelerates math functions commonly found on mainframe workloads.
My only real comment about the above is that four DP floating-point units is a lot of floating-point power. This makes sustained streaming bandwidth from memory critically important for POWER7’s FP performance, so it’s a good thing that it has plenty of it.
I’m told that the POWER7 continues the “group dispatch” scheme that has been a part of the POWER line since the POWER4 days. I described in detail how this works in my first article on the PowerPC 970 (a.k.a., G5)—in a nutshell, it cuts down on the amount of bookkeeping logic needed to track in-flight instructions by dispatching and tracking the instructions in bundles. On the POWER4 and 970, instructions dispatched from the instruction queue to the back end in bundles of 5 each, but the dispatch groups have now been widened to 6 slots.
In all, IBM has produced a very impressive 32-thread monster of a chip with a ton of cache and plenty of memory bandwidth, and done so with half the transistors of the competition. This is quite an achievement, and it reiterates just how strong IBM remains in the very lucrative mainframe market.