Taking Arm Neoverse into 3D with Digital Full Flow

Paul McLellan

Arm's Shawn Hung (based in Austin) and Cadence's Rod Metcalfe presented on doing 3D design at Arm DevSummit, in a presentation titled Implementing 3D Neoverse N1: 3D Design Merits Meet In-Depth Analysis. What they described was an implementation of an Arm Neoverse N1 implemented on two die that were then attached face to face with a process known as hybrid wafer bonding. This was Arm's first face-to-face wafer-bonded design.

There is a huge increase in interest and use of various forms of advanced packaging that often go under the catchy name "More than Moore". The first chips that used 3D technology were actually the image sensors for cameras, which flipped the image sensor itself over (so the light entered through the back of the thinned die) and then attached it to the image processor, which could then pull data vertically from the sensor rather than having to get the data to the edge of the image sensor die. The next 3D chip that got attention was Xilinx's large FPGA where they split the array into four identical die and mounted them on an interposer. AMD's product line of CPUs are all built out of a range of die assembled on an interposer. The driver for AMD was that a die that big would not yield, or perhaps not even fit in the reticle, plus the high-end chips used HBM2 memories which pretty much requires an interposer. For a look at the range of designs that are using advanced SiP (system-in-package), see my post HOT CHIPS: Chipletifying Designs.

What Shawn described was something more ambitious still: to take a monolithic design, and split it in two identically sized die, and then flip the top die over and attach it to the lower die to form a sandwich (as in the picture above). He described it as a test chip but actually it is more of a proof-of-concept, and there are no plans to actually tape out and manufacture the test chip.

There are several motivations for why you might want to manufacture a processor like this:

Energy-efficient bandwidth and lower latency memory access
Lower cost since two smaller die yield better than one large one
Better scalability from higher compute density

They did a previous proof-of-concept design called Trishul last year (which they reported on at Arm TechCon although I didn't see it) to prove the readiness of 3D stacking:

Process GF 12LP FinFET
Operating frequency 2.7GHz (TT, 1V, 85°C)
Demonstrated 3D bandwidth of 2.355 Tbps and 3.68 TBps/mm2 on CMN-600 (Porter).
Measured gate-delay between 3D layers is in the range of 10-12ps. Equivalent to FO2 gate delay in 2D
Cross-3D gate delay in 6-8ps feasible
3D designs do not need special consideration for 3D interface parasitics (global wire equivalent)
Process skew between wafers manageable
3D connection pitch was 3.76um
Tested 2000+ dies across 34 wafers: cumulative of 13.485 million 3D connections

For this project, the plan was to stack the main memory on the upper tier of a 3D microprocessor, since main memory is a significant bottleneck. In principle, for a processor with considerable memory demand, increasing the size of the on-chip L2 cache is an efficient approach to improve performance...except increasing the size of the L2 cache increases the time to access this memory. Folding the cache over the top of the logic stages of the pipeline reduces this access time. The design was done in 7nm.

In fact, for thermal reasons, it makes more sense to put the memory (L1 and L2 caches) on the bottom tier and logic on the top tier. This also enabled them to double the size of the L2 cache. It is only possible to build a 1MB L2 cache with a 9-cycle read in 3D. In 2D, it requires two extra cycles.

Rod explained some details about the Cadence 3D-IC solution. I won't repeat that since I've covered that extensively, for example in my post John Park's Webinar on Chiplets from a few months ago.

This is the flow that was used for this design. There are some considerations about what goes where. The blocks that communicate frequently should be assigned to adjacent tiers since that decreases the length of the inter-block connections. This both increases communication bandwidth while reducing power. But blocks with high switching activities should not be placed on top of each other vertically to keep the temperature profile within specified limits. The vertical connections were handled as virtual anchor cells which are pairs, one on each die, conceptually aligned in 3D (see the example diagram).

The design is actually done with the virtual anchor cells connected by a dummy wire that doesn't really exist. Eventually, that wire is removed and the two die flipped. But in the meantime, all the 2D design algorithms work normally.

Since both die are bonded face-to-face (and are the same size), traditional flip-chip packaging will not work since both "top" and "bottom" of the stack are actually the backsides of die. Power was handled with through-silicon-vias (TSVs) going through the bottom die. It was then spread out through the bottom die, and across the wafer bond to power the top die.

Shawn also went into a lot of detail about constructing the clock tree across the two die using Innovus Implementation and the CCOpt tool. I'm going to skip that as being too much of a deep dive for a post like this. But the clock tree was better than in the 2D implementation, with 18% lower clock latency, about half the number of clock buffers, and 27% lower clock tree power. What's not to like?

Even a 2D microprocessor requires some level of thermal analysis. It is even more essential for a design like this since the bottom die is sandwiched between the package substrate and the top die, so there are more limited paths for heat to "escape". The top die is in thermal contact with the heatsink so is less of a challenge. A Celsius Thermal Solver was used to create heatmaps.

The heatmap above shows the N1 in 2D on the left (that was used to develop models of the package and heatsink), and the two folded N1 die on the right. Celsius shows that the steady-state temperature when running at 'maxpower' is 6°C higher than the 2D N1. In reality, it might be lower since 'maxpower' is beyond realistic and is a viral power vector.

Voltus was used to do IR analysis. This is more critical than ever since all the power for the top die passes through the bottom die. Indeed, Voltus showed that most IR drop is on via pillars stacked on top of the TSV (see the diagrams earlier).

The final conclusions:

A comprehensive study on a signoff quality physical design of a 3D high-performance microprocessor, Neoverse N1 CPU, using face-to-face (F2F) wafer-bonding technology
Logic over memory partitioning achieving 2-cycle lower L2 access than 2D
RTL to signoff using a complete Cadence digital flow for 3D CPU implementation achieving comparable frequency (<5%) and substantial area/power benefits vs. 2D design as the pioneer in the industry
Clock tree synthesis demonstrates 18% lower clock latency, ~50% less clock buffers, and 27% lower clock tree power
Detailed 3D PDN and thermal analysis complete:
- PDN shows a worst-case drop of 6.2% located at the bottom logic-die, further optimization on the via pillars stacked from TSV can reduce the impact with minimal effect on PPA
- Thermal analysis shows worst-case peak temperature rise to be 6 degrees higher than 2D, the real-world impact is likely lower and can be mitigated with advanced cooling techniques
The measured data from Trishul showcased the solid silicon proof of point on the applied 3D stacking technology