AWS and Cerebras Systems have announced a collaboration to bring the fastest AI inference performance to generative AI applications and large language model workloads. The solution, deploying in AWS data centers and accessible exclusively via Amazon Bedrock, combines AWS Trainium for prefill processing with Cerebras CS-3 systems for decode acceleration, connected by high-speed Elastic Fabric Adapter networking.
“Inference is where AI delivers real value to customers, but speed remains a critical bottleneck for demanding workloads like real-time coding assistance and interactive applications,” said David Brown, Vice President, Compute & ML Services, AWS. “What we're building with Cerebras solves that: by splitting the inference workload across Trainium and CS-3, and connecting them with Amazon’s Elastic Fabric Adapter, each system does what it's best at. The result will be inference that's an order of magnitude faster and higher performance than what's available today."
The partnership introduces inference disaggregation, separating the two distinct phases of AI inference: prefill (prompt processing), which is parallel and compute-intensive, and decode (output generation), which is serial, memory-bandwidth heavy, and often dominates total inference time. By assigning prefill to Trainium—optimized for scalable, cost-efficient processing—and decode to Cerebras CS-3—with thousands of times greater memory bandwidth than leading GPUs—the solution maximizes throughput for each phase.
The Cerebras CS-3, powered by the world’s largest Wafer Scale Engine 3 (WSE-3), excels at decode operations, especially for reasoning models and agentic workloads that generate more tokens per request. Trainium, Amazon’s purpose-built AI chip, handles prefill efficiently and supports major customers including Anthropic (primary training partner) and OpenAI (committed to 2 gigawatts of capacity). High-bandwidth Elastic Fabric Adapter networking ensures low-latency communication between the systems.
“Partnering with AWS to build a disaggregated inference solution will bring the fastest inference to a global customer base,” said Andrew Feldman, Founder and CEO of Cerebras Systems. “Every enterprise around the world will be able to benefit from blisteringly fast inference within their existing AWS environment.”
Built on the AWS Nitro System, the solution maintains enterprise-grade security, isolation, and consistency. This pioneering approach positions AWS as the first cloud provider to offer Cerebras’s disaggregated inference technology, enabling developers and enterprises to achieve dramatically faster token generation for real-time applications without leaving the AWS ecosystem.
The collaboration addresses growing demand for high-speed inference in interactive and productivity-focused use cases. By optimizing each inference stage on specialized hardware, the Trainium + CS-3 solution delivers significantly higher performance and capacity compared to traditional GPU-based approaches, accelerating value realization from generative AI.
About Amazon Web Services
Amazon Web Services (AWS) is guided by customer obsession, pace of innovation, commitment to operational excellence, and long-term thinking. By democratizing technology for nearly two decades and making cloud computing and generative AI accessible to organizations of every size and industry, AWS has built one of the fastest-growing enterprise technology businesses in history. Millions of customers trust AWS to accelerate innovation, transform their businesses, and shape the future. With the most comprehensive AI capabilities and global infrastructure footprint, AWS empowers builders to turn big ideas into reality. Learn more at aws.amazon.com and follow @AWSNewsroom.
About Cerebras Systems
Cerebras Systems builds the fastest AI infrastructure in the world. We are a team of pioneering computer architects, computer scientists, AI researchers, and engineers of all types. We have come together to make AI blisteringly fast through innovation and invention because we believe that when AI is fast it will change the world. Our flagship technology, the Wafer Scale Engine 3 (WSE-3) is the world’s largest and fastest AI processor. 56 times larger than the largest GPU, the WSE uses a fraction of the power per unit compute while delivering inference and training more than 20 times faster than the competition. Leading corporations, research institutes and governments on four continents chose Cerebras to run their AI workloads.