Arm has not one but two new high-performance CPUs destined for 2021 mobile SoCs. First is the anticipated Cortex-A78, building on the standard Cortex-A roadmap. The surprise announcement is the Cortex-X1, a powerhouse CPU designed with partners in Arm’s new CXC program, which replaces “Built on Arm Cortex.�
Arm’s Cortex-A78 and Cortex-X1 are both based on the previous generation Cortex-A77. However, the two ARM processors are designed with different design goals in mind. The Cortex-A78 focuses on delivering more performance per watt within a slightly smaller area than before. The Cortex-X1 discards these usual concerns in the pursuit of maximum performance.
Both CPUs are destined for premier tier SoCs and smartphones in 2021, perhaps even in conjunction with one another. However, not every 2021 chipset will necessarily offer the extreme performance of the Cortex-X1. It’s only available to participants of Arm’s CXC program. But more on that later, let’s see what’s new for 2021 smartphone CPUs.
Arm Cortex-A78: Efficiency is the game
Let’s start with metrics for you numbers junkies. The Arm Cortex-A78 promises a 20% boost to sustained performance over the Cortex-A77 for a 1W power budget, thanks to the architecture changes, available clock speed boosts, and the move from 7nm to 5nm manufacturing. More impressively, a 2.1GHz 5nm Cortex-A78 consumes up to 50% less power than a 2.3GHz 7nm Cortex-A77, according to Arm. That’s a boon for battery life.
On a like for like process, the Cortex-A78’s performance gains are a little less impressive. There’s just a 7% typical performance improvement from the revised micro-architecture. However, that comes with a 4% reduction in power consumption, so expect the Cortex-A78 to sustain its peak performance a little longer than than the A77 and A76. The A78 is also 5% smaller, resulting in a 15% area saving for a quad-core cluster. That frees up more room for extra GPU, NPU, or other components on silicon, or just helps keep prices down.
Turning to the micro-architecture, Arm has made a number of significant changes. For starters, the Cortex-A78 comes with an optional smaller 32kB L1 cache configuration, which is where the majority of the space savings come in. Although Arm’s partners can still opt for a more familiar 64kB L1 cache to boost the core’s performance further. Qualcomm did something similar with larger L2 caches for its Snapdragon Prime core, and this remains flexible up to 512kB to balance performance, area, and power this generation.
To offset this smaller L1 memory, the branch predictor is better at covering irregular search patterns and is now capable of following two taken branches per cycle. This results in fewer L1 cache misses and helps hide pipeline bubbles to keep the core well fed. The pipeline is 1-cycle longer compared to the A77, ensuring the A78 hits a clock frequency target around 3GHz, but it’s still a 6 instruction per cycle design.
Cortex-A78 optimizes power and area, with more conservative performance improvements.
Arm also introduces a second integer multiple unit in the execution unit and an additional load Address Generation Unit (AGU) to increase the data load bandwidth by 50%. Other optimizations include more fused instructions and efficiency improvements to the instruction schedulers, register renaming structures, and the reorder buffer. The bottom line is that the Cortex-A78 is a leaner, more optimized CPU than the A77.
The Cortex-A78 targets peak efficiency over performance. That’s great for battery life but not so great for enthusiasts hoping that Android would close the gap with Apple next year. For that, you’ll want a phone powered by the Arm Cortex-X1.
More from Arm: Mali-G78 and Mali-G68 graphics announced
Arm Cortex-X1: Ultimate performance
The Cortex-X1 is the first graduate of Arm’s new CXC program. With CXC, Arm’s partners take a performance point off the usual roadmap, and Arm designs a CPU for them. However, a partner must be in the program from the start to have access to the final product. This year’s collective approach is to seriously ramp up the performance of Arm’s Cortex lineup.
For Cortex-X1, Arm anticipates a 30% jump in performance compared to the Cortex-A77. This works out to an impressive 23% boost over the Cortex-A78 at integer crunching, making it a clear winner in demanding workloads. The Cortex-X1 also boasts double the machine learning prowess of these two CPUs.
Cortex-X1 answers calls for an Arm CPU with extreme performance.
It’s a significant change in approach, but that speed comes at the cost of a larger surface area and increased power. For Arm’s partners, this means less multi-threaded performance and efficiency per square millimeter of silicon. As such, it seems unlikely that smartphone SoCs will use quad Cortex-X1 clusters. We’re more likely to see a single Cortex-X1 paired up with three Cortex-A78s. Such a configuration only takes up 15% more area than a quad-core Cortex-A76 cluster while delivering that much sought after single-thread boost.
Achieving the Cortex-X1’s target performance required a number of major micro-architecture changes. For starters, the core has a lot more memory than the A77 and A78. The L2 cache is variable up to 1MB and has double the bandwidth to maximize the performance benefit, while the shared L3 cache can hit 8MB, double previous generations. Interestingly, there’s a specific Dynamic Shared Unit (DSU) included with the Cortex-X1 to allow for the 8MB configuration, which shares that memory with any Cortex-A78s in the cluster as well.
The larger cache is complimented by a more powerful execution core. SIMD floating-point instruction processing doubles to 4x-128 bits of bandwidth, producing the 2x machine learning uplift. The processor also boasts a 40% increase to its out-of-order execution window with 224 entry instructions. This exposes more instruction-level parallelism, with the aim to have the processor do more at once.
The big X1 core demands more power and silicon area.
Keeping all this fed with things to do is a 50% larger L0 branch target buffer, a 5-wide I-cache instruction fetch, and 8 micro-operation fetch from the dedicated Mop cache. That’s double the Cortex-A77’s fetching capacity and a 33% increase over the A78’s 6-wide dispatch bandwidth. In other words, the Cortex-X1 can do a lot more with each clock cycle than previous Arm CPU cores.
Arm Cortex-A78 vs Cortex-X1
The bulk of Arm’s Cortex-A78 performance gains comes from the move to 5nm, making it the most conservative generational improvement we’ve seen for a few years. Instead, area and performance optimizations are the key talking points, which is, of course, good for gadget battery life. Crucially, this design choice complements the powerhouse Cortex-X1 in mixed cluster configurations.
A tri-tier SoC with a single X1, three A78s, and four A55s could deliver a great balance of performance and efficiency for smartphones, propelling Android performance up to compete with Apple’s custom CPUs. A multi-core Cortex-X1 SoC is also an exciting prospect for the Windows on Arm ecosystem, driving capabilities into the higher-end of the computing market.
We don’t know which manufacturers have the Cortex-X1 yet, but Qualcomm seems likely.
However, the nature of the CXC program creates the new prospect that not every mobile SoC designer has access to Arm’s highest-performing core. We don’t know who is in the program yet, but Qualcomm seems like a sure thing as it previously participated in Built on Arm Cortex for Kryo. This could give the next-gen Snapdragon an edge on its competitors. The Cortex-A78 scales up with larger cache configurations for those who need the extra performance, but CXC partners will have a notable advantage.
The arrival of not one, but two big Cortex-A cores marks a major shift in strategy for Arm that will drive major product differentiation in next year’s smartphones and always-connected laptops. Keep an eye on SoC announcements from the major players towards the end of 2020 to see how this pans out.