Amazon Trainium3 Raises AI Stakes, Hints Nvidia Tie
Amazon Trainium3 debuts with large per-chip compute and memory gains; UltraServers and Red Hat support shift cloud AI cost and scale dynamics.

KEY TAKEAWAYS
- Trainium3 had eight NeuronCore-v4 cores, 144 GiB device memory and 4.9 TB/sec device memory bandwidth.
- UltraServers with Trainium3 delivered 4.4x more compute and 3.9x higher memory bandwidth versus Trainium2.
- Red Hat said its AI Inference Server can yield 30-40% better price-performance versus comparable GPU EC2 instances.
HIGH POTENTIAL TRADES SENT DIRECTLY TO YOUR INBOX
Add your email to receive our free daily newsletter. No spam, unsubscribe anytime.
Amazon said Trainium3, its fourth-generation AI training chip, became generally available on Dec. 2, 2025, at re:Invent. The chip delivers significant per-device compute and memory gains that Amazon Web Services (AWS) said will power UltraServers for production AI workloads.
Trainium3 Hardware Advances and System Performance
Trainium3 features eight NeuronCore-v4 cores, 144 GiB of on-device memory, and 4.9 terabytes per second (TB/sec) of device memory bandwidth. This represents a 1.5x increase in memory capacity and a 1.7x boost in memory bandwidth compared with Trainium2. Direct memory access (DMA) bandwidth also improved by 1.4x to 4.9 TB/sec.
The chip adds a NeuronLink-v4 interconnect providing 2.56 TB/sec per-device bandwidth for scale-out training and pooled memory. It uses 16 collective communication cores (CC-Cores) to coordinate data transfer across chips and servers.
Trainium3 delivers 2,517 trillion floating-point operations per second (TFLOPS) for FP8 and the new MXFP4 numeric format, 671 TFLOPS for BF16/FP16/TF32, and 183 TFLOPS for FP32. FP8 throughput roughly doubles that of the prior generation, while MXFP4 is introduced as a new capability.
The architecture enhances programmability with support for dynamic shapes, control flow, user-programmable rounding modes, and custom operators implemented via GPSIMD engines, broadening model and operator compatibility.
At the system level, AWS’s UltraServers powered by Trainium3 offer aggregate gains over the previous generation, including 4.4x more compute, 3.9x higher memory bandwidth, and 3.5x more tokens per megawatt, an efficiency metric for large-scale training deployments.
Ecosystem Integration and Scale
Red Hat announced that its Red Hat AI Inference Server, powered by vLLM, will run on AWS Inferentia2 and Trainium3 chips. The company said this integration delivers 30–40% better price-performance than comparable GPU-based Amazon EC2 instances, aiming to provide a common inference layer supporting any generative AI model.
AWS reported deploying more than 1 million Trainium processors, with customers increasingly using the chips for inference workloads, expanding beyond their original training-focused role.
Over the past year, AWS added 3.8 gigawatts of data-center capacity—the largest increase among competitors—and expanded its private network backbone by 50% to over 9 million kilometers of terrestrial and subsea cable. These infrastructure investments support scaling production AI.
AWS also previewed Trainium4, targeting sixfold FP4 performance, fourfold memory bandwidth, and double the memory capacity compared with Trainium3. The company signaled plans to integrate Nvidia technology into future Trainium generations, indicating a multi-vendor silicon strategy rather than exclusive displacement.
Separately, AWS introduced "AI Factories," a managed, customer-specific infrastructure service built and scaled by AWS. This service will run on Trainium and other custom silicon, providing a route for customers to deploy production AI at scale.
Red Hat described its AI Inference Server as designed "to deliver a common inference layer that can support any gen AI model," helping customers achieve higher performance, lower latency, and cost-effective scaling for production AI deployments.





