博客详情

深入了解行业资讯与专业见解

Scaling Challenges /Reflections on Hardware for AI Architectures

免责声明与导读说明:

本文章内容由人工智能基于大量公开数据进行整理与分析,旨在为读者提供参考与研究之用。由于数据来源广泛,分析过程涉及模型推理,内容的准确性、完整性与时效性可能存在偏差,敬请读者自行甄别判断,切勿将本文内容作为唯一决策依据。如涉及重要判断,请结合权威信息或专业意见共同参考。

Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, JiashiLi, Liyue Zhang, PanpanHuang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei

DeepSeek-AI

Beijing, China

Abstract

The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including con- straints in memory capacity, computational efficiency, and intercon- nection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800  GPUs, demonstrates how hardware-aware model co-design can  effectively address these challenges, enabling cost-efficient training  and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention  (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hard- ware capabilities, and a Multi-Plane Network Topology to minimize  cluster-level network overhead. Building on the hardware bottle- necks encountered during DeepSeek-V3’s development, we engage  in a broader discussion with academic and industry peers on po- tential future hardware directions, including precise low-precision  computation units, scale-up and scale-out convergence, and in- novations in low-latency communication fabrics. These insights  underscore the critical role of hardware and model co-design in  meeting the escalating demands of AI workloads, offering a practi- cal blueprint for innovation in next-generation AI systems.

CCS Concepts

• Computer systems organization → Architectures.

Keywords

Large Language Model, Mixture-of-Experts, Deep Learning, FP8 Mixed-Precision Training, Multi-Plane Network, Co-Design

ACM Reference Format:

Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, JiashiLi, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wen- feng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y.X. Wei . 2025. Insights  into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI  Architectures. In Proceedings of the 52nd Annual International Symposium  on Computer Architecture (ISCA ’25), June 21–25, 2025, Tokyo, Japan. ACM, New York, NY, USA, 14pages. https://doi.org/10.1145/3695053.3731412

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

ISCA ’25, June 21–25, 2025, Tokyo, Japan

© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-1261-6/2025/06

https://doi.org/10.1145/3695053.3731412

1   Introduction

1.1   Background

Large Language Models (LLMs) have undergone rapid evolution in recent years, driven by iterative advancements in model design, computational power, and data availability. In 2024, groundbreak- ing models such as GPT4o [59], LLaMa-3 [3], Claude 3.5 Sonnet [8], Grok-2 [73], Qwen2.5 [75], Gemini-2 [37] and our DeepSeek-V3 [26]  have showcased remarkable progress, further narrowing the gap to- wards Artificial General Intelligence (AGI). As the Scaling Laws [45]  shows, increasing model size, training data, and computational re- sources leads to substantial improvements in model performance, underscoring the pivotal role of scaling in advancing AI capabilities. Collectively, these developments have ushered in an era where  scaling model size and computational power is seen as the key to  unlocking higher levels of intelligence.

Recent developments, reasoning models such as OpenAI’s o1/o3 series models [60, 61], DeepSeek-R1 [28], Claude-3.7 Sonnet [9], Gemini 2.5 Pro [38], Seed1.5-Thinking [68] and Qwen3 [71] have  demonstrated not only the benefits conferred by large-scale archi- tectures, but also the necessity of improving inference efficiency, particularly in handling longer contexts and achieving greater rea- soning depth. These advancements underscore the need for faster and more efficient inference, consequently placing ever-increasing  demands on computational resources.

To meet these challenges, industry leaders such as Alibaba, ByteDance, Google, xAI and Meta have deployed colossal train- ing clusters [33,42,43,56,62,74], featuring tens or even hundreds  of thousands of GPUs or TPUs. While such massive infrastructures  have enabled the development of state-of-the-art models, their exor- bitant costs present significant barriers for smaller research teams  and organizations. Despite these barriers, open-source startups such  as DeepSeek [23–26, 28] and Mistral [41, 55] are also striving to  develop state-of-the-art models. Among them, DeepSeek has espe- cially demonstrated that effective software-hardware co-design can  enable cost-efficient training of large models, leveling the playing  field for smaller teams.

Building on this tradition, DeepSeek-V3 [26] represents a new  milestone in cost-effective training. By leveraging just 2,048 NVIDIA H800 GPUs, DeepSeek-V3 achieves state-of-the-art performance.  This achievement aligns with the commitment to advance AI through practical and scalable solutions, as previously demonstrated in the  cost-effective architecture of Fire-Flyer AI-HPC [7]. The practices  and insights derived from DeepSeek-V3 demonstrate how exist-  ing hardware resources can be harnessed to their fullest potential,  offering valuable lessons for the broader AI and HPC communities.

Authors are listed in alphabetical order of their first names. Yuqing Wang and Liyue Zhang are the corresponding authors of this paper (e-mail: research@deepseek.com).

1.2 Objectives

This paper does not aim to reiterate the detailed architectural and algorithmic specifics of DeepSeek-V3, which are extensively docu- mented in its technical report [26]. Instead, it adopts a dual perspec- tive—spanning hardware architecture and model design—to explore  the intricate interplay between them in achieving cost-efficient  large-scale training and inference. By examining this synergy, we  aim to provide actionable insights for scaling LLMs efficiently with- out sacrificing performance or accessibility.

Specifically, the paper focuses on:

Hardware-Driven Model Design: Analyze how hardware fea- tures, such as FP8 low-precision computation and scale-up/scale- out network properties, informed the architectural choices in  DeepSeek-V3.

Mutual Dependencies Between Hardware and Models: In- vestigate how hardware capabilities shape model innovation  and how the evolving demands of LLMs drive the need for next- generation hardware.

Future Directions for Hardware Development: Derive ac- tionable insights from DeepSeek-V3 to guide the co-design of future hardware and model architectures, paving the way for scalable, cost-efficient AI systems.

1.3   Structure of this Paper

The remainder of this paper is organized as follows. Section 2  explores the design principles underpinning DeepSeek-V3 model  architecture, highlighting key innovations such as Multi-head La- tent Attention, Mixture-of-Experts optimizations and Multi-Token  Prediction Module. Section3illustrateshow our model architecture  pursues low-precision computation and communication. Section 4 includes scale-up interconnection optimizations, discusses scale- up/scale-out convergence, and explores how hardware features  influence parallelism and expert selection strategies. Section 5fo- cuses on scale-out network optimizations, including multi-plane  network co-designs and low-latency interconnects. Besides current  limitations and future suggestions mentioned in Section 3~5, Sec- tion 6elaborates on more critical insights from DeepSeek-V3, and  identifies directions for future hardware and model co-design.

2 Design Principles for DeepSeek Models

The development of DeepSeek-V3 exemplifies a hardware-aware approach to scaling LLMs, where each design decision was carefully aligned with hardware constraints to optimize performance and cost efficiency.

As shown in Figure 1, DeepSeek-V3 employs the DeepSeek-  MoE [27] and Multi-head Latent Attention (MLA)DeepSeek- MoE unlocks the potential of MoE architecture, while MLA dras-  tically reduces memory consumption by compressing Key-Value   (KV) caches. In addition, DeepSeek-V3 incorporates FP8 mixed-  precision training, significantly lowering computational costs   and making large-scale training more practical without compromis-  ing model quality. To improve the inference speed, DeepSeek-V3   integrates speculative decoding based on its Multi-Token Predic-  tion Module, which significantly increases the generation speed.  Beyond model architecture, we also explored cost-efficient AI infras-  tructure by deploying a Multi-Plane two-layer Fat-Tree network to replace a traditional three-layer Fat-Tree topology, reducing cluster networking costs.

These innovations aim to address three core challenges in scaling LLMs—memory efficiency, cost-effectiveness, and inference speed—which are explored in detail in the following subsections.

2.1   Memory Efficiency

LLMs generally require significant memory resources, with memory demands increasing by more than 1000% per year. In contrast, the growth rate of high-speed memory (e.g., HBM) capacity is much  slower, typically less than 50% per year [35]. While multi-node  parallelism is a viable solution to address memory limitations, opti- mizing memory usage at the source remains a crucial and effective  strategy.

2.1.1 Low-Precision Models.  Compared to models that utilize BF16 for weights, FP8 significantly reduces memory consumption by half, effectively alleviating the AI memory wall challenge. A detailed discussion of low-precision techniques is provided in Section 3 Low-Precision Driven Design.

2.1.2 Reducing KV Cache with MLA. For LLM inference, user re- quests often involve multi-turn conversations. To handle these  efficiently, the context from previous requests is cached in what is  commonly referred to as the KV cache. KV cache addresses this chal- lenge by caching the Key and Value vectors of previously processed  tokens, eliminating the need to recompute them for subsequent to- kens. During each inference step, the model only computes the Key and Value vectors for the current token and performs attention com- putation by combining them with the cached Key-Value pairs from  the history. This incremental computation reduces the complexity of generating each token to o (N), making it efficient when pro- cessing long sequences or multi-turn inputs. However, it introduces  a memory-bound bottleneck because the computation shifts from  GEMM to GEMV, which has a much lower compute-to-memory ratio. With modern hardware offering hundreds ofTFLOPS, GEMV quickly becomes limited by memory bandwidth, making memory access the primary bottleneck.

To address this bottleneck, we employ Multi-head Latent At- tention (MLA) [25] that compresses the KV representations of all attention heads into a smaller latent vector using a projection  matrix, which is jointly trained with the model. During inference, only the latent vector needs to be cached, significantly reducing  memory consumption compared to storing the KV cache for all  attention heads.

In addition to MLA, several other approaches have been pro- posed to reduce the size of the KV cache. These methods are highly valuable and provide significant inspiration for advancements in memory-efficient attention mechanisms:

Shared KV (Grouped-Query Attention, GQA; Multi-Query Attention, MQA): Instead of maintaining separate KV pairs for each attention head, multiple heads share a single set of KV pairs, significantly compressing KV storage. Representative methods include GQA [5] and MQA [70].

Windowed KV: For long sequences, only a sliding window of KV pairs is retained in the cache, discarding results out- side the window. While this reduces storage, it compromises  long-context reasoning. Representative methods include Long- former [11] and related architectures

Figure 1: Basic architecture of DeepSeek-V3. Built upon DeepSeek-V2’s MLA and DeepSeekMoE, a Multi-Token Prediction Module and FP8 mixed-precision training are introduced to enhance inference and training efficiency. The figure indicates the precision used for computations in different parts of the architecture. All components take inputs and outputs in BF16.

Quantized Compression: KV pairs are stored using low-bit  representations [40, 44, 52], further reducing memory usage. Quantization achieves significant compression with minimal  impact on model performance.

Table 1compares the KV cache memory usage per token among  DeepSeek-V3, Qwen-2.5 72B [75], and LLaMA-3.1 405B [4]. By adopting MLA, DeepSeek-V3 achieves a significant reduction in KV cache size, requiring only 70 KB per token, substantially less than  LLaMA-3.1 405B’s 516 KB and Qwen-2.5 72B’s 327 KB. This reduc- tion highlights the efficiency of MLA in compressing KV representa- tions compared to GQA-based methods. The ability to achieve such  a significant reduction in memory consumption makes DeepSeek- V3 particularly well-suited for scenarios involving long-context  processing and resource-constrained environments, enabling more  scalable and cost-effective inference.

2.1.3 Future Directions and Perspectives on Resource-Efficient Tech- niques. While reducing the size of the KV cache is a promising method for improving memory efficiency, the quadratic complexity inherent in Transformer-based autoregressive decoding remains a formidable challenge, especially for extremely long contexts. Recent research efforts, such as Mamba-2 [21] and Lightning Attention [63], investigate linear-time alternatives that offer new possibilities for balancing computational cost and model performance. In addition,approaches such as sparse attention [76], which seek to compress and sparsely activate attention keys and values, represent another attempt at overcoming the computational challenges associated with attention. We look forward to collaborative progress with the broader community toward breakthroughs in this area.

2.2 Cost-Effectiveness of MoE Models

For sparse computing, we have developed DeepSeekMoE, an ad- vanced Mixture of Experts (MoE) architecture, which is illus- trated in the lower right part of Figure 1. The advantages of MoE  models lie in two folds.

2.2.1 Reducing Computational Requirements for Training. The pri- mary advantage of the MoE architecture lies in its ability to sig- nificantly reduce training costs. By selectively activating only a  subset of expert parameters, MoE models allow the total param- eter count to scale up dramatically while keeping computational  requirements modest. For example, DeepSeek-V2 features 236B  parameters, but only 21B parameters are activated per token. Sim- ilarly, DeepSeek-V3 expands to 671B parameters—nearly three  times the size of V2—while keeping the activation per token at  just 37B. In comparison, dense models such as Qwen2.5-72B and  LLaMa3.1-405B require all parameters to be active during training.

Table 1:  KV  cache   size  comparison  (BF16  precision): DeepSeek-V3 (MLA) largely reduces KV cache size compared to other models using GQA.

Model

KV Cache Per Token

Multiplier

DeepSeek-V3 (MLA)

70.272 KB

1x

Qwen-2.5 72B (GQA)

327.680 KB

4.66x

LLaMA-3.1 405B (GQA)

516.096 KB

7.28x

As shown in Table 2, the total computational cost for DeepSeek- V3 is approximately 250 GFLOPS per token, whereas the 72B dense  model requires 394 GFLOPS and the 405B dense model requires 2448  GFLOPS. This demonstrates that MoE models achieve comparable  or even superior performance to dense models while consuming  an order of magnitude less computational resources.

2.2.2 Advantages for Personal Use and On-Premises Deployment. In a future where personalized LLM agents [53] become ubiquitous, MoE models offer unique advantages in single-request scenarios. Because only a subset of parameters is activated per request, mem- ory and computational demands are greatly reduced. For example, DeepSeek-V2 (236B parameters) activates just 21B parameters  during inference. This enables PCs with AI SoC chips [6,10,58]  to achieve nearly 20 tokens per second (TPS), or even twice that  speed, which is more than sufficient for personal use. In contrast, dense models of similar capability (e.g., 70B parameters) typically reach only single-digit TPS on similar hardware.

Notably, the increasingly popular KTransformers [39] inference  engine allows the complete DeepSeek-V3 model to run on a low- cost server equipped with a consumer GPU (costing approximately $10,000), while still achieving nearly 20 TPS.

This efficiency makes MoE architectures suitable for local de- ployments and single-user scenarios, where hardware resources are often limited. By minimizing memory and computational over- head, MoE models can deliver high-quality inference performance  without requiring expensive infrastructure.

2.3 Increasing Inference Speed

2.3.1 Overlapping Computation and Communication: Maximizing Throughput. Inference speed encompasses both system-wide maxi- mum throughput and single-request latency. To maximize through- put, our model is architected from the outset to leverage dual micro- batch overlap [31,78], intentionally overlapping communication  latency with computation. As demonstrated in our online infer- ence system and supported by open-source profiling data [31], we decouple the computation of MLA and MoE into two distinct  stages. While one micro-batch executes a portion of MLA or MoE  computation, the other micro-batch simultaneously performs the  corresponding dispatch communication. Conversely, during the  computation phase of the second micro-batch, the first micro-batch  undergoes the combine communication step. This pipelined ap- proach enables seamless overlap of all-to-all communication with  ongoing computation, ensuring that the GPU remains fully utilized  at all times. Moreover, in production, we adopt a prefill and decode  disaggregation architecture [80], assigning large batch size prefill  and latency-sensitive decode requests to different expert parallelism  group sizes. This strategy ultimately maximizes system throughput  under real-world service conditions.

Table 2: Comparison of computational costs for training MoE and dense models: Computational cost per token is measured, assuming a sequence length of 4096.

Model

Size

Training Cost

DeepSeek-V2 MoE

236B

155 GFLOPS/Token

DeepSeek-V3 MoE

671B

250 GFLOPS/Token

Qwen-72B Dense

72B

394 GFLOPS/Token

LLaMa-405B Dense

405B

2448 GFLOPS/Token

2.3.2 Inference Speed Limits. This section focuses on the decode  output speed of LLM services, typically measured in Time Per Output Token (TPOT). TPOT is a critical metric for user experi- ence, and it also directly impacts the responsiveness of reasoning  models such as OpenAI’s o1/o3 and DeepSeek-R1, which rely on  the inference length to enhance their intelligence.

For MoE models, achieving high inference speed relies on effi- ciently deploying expert parameters across computing devices. To achieve the fastest possible inference speed, each device should  ideally perform computations for a single expert (or multiple de- vices should collaboratively compute a single expert if necessary). However, Expert Parallelism (EP) requires routing tokens to the  appropriate devices, which involves all-to-all communication  across the network. As a result, the upper limit of MoE inference  speed is dictated by interconnection bandwidth.

Consider a system where each device holds one expert’s param- eters and processes approximately 32 tokens at a time. This token count strikes a balance between compute-to-memory ratio and com- munication latency. And this token count ensures that each device  processes an equal batch size during expert parallelism, allowing  the communication time to be easily calculated.

For a system interconnected with CX7 400Gbps InfiniBand (IB) NICs, the time required for the two all-to-all communications in EP is calculated as follows:Comm. Time = (1Byte + 2Bytes) × 32 × 9 × 7K/50GB/s = 120.96μs

Here, dispatch uses FP8 (1 byte), while combine uses BF16 (2 bytes), and the hidden size of each token is approximately 7K. The factor 9 indicates that each token is transferred to 8 routed experts and 1 shared expert.

As discussed in Section 2.3.1, maximizing throughput neces- sitates the use of dual micro-batch overlap. In this strategy, our theoretical best-case analysis assumes that computation overhead  is minimized, so the upper bound on performance is determined by communication latency. In practical inference workloads, however, request contexts are often much longer, and MLA computations  typically dominate execution time. Thus, this analysis represents  an idealized scenario under dual micro-batch overlap. Under this  assumption, the total time per layer can be formulated as:

Total Time Per Layer = 2 × 120.96μs = 241.92μs With 61 layers in DeepSeek-V3, the total inference time is:

Total Inference Time = 61 × 241.92μs = 14.76ms

Thus, the theoretical upper limit for this system is approximately 14.76 ms TPOT, equivalent to 67 tokens per second. However, in practice, factors such as communication overhead, latency, in- complete bandwidth utilization, and computational inefficiencies  reduce this number.

By contrast, if a high-bandwidth interconnect like GB200 NVL72 (900GB/s unidirectional bandwidth across 72 GPUs) were used, the communication time per EP step drops to:

Comm. Time = (1Byte + 2Bytes) × 32 × 9 × 7K/900GB/s = 6.72μs

Assuming the computation time is equal to the communication time, this reduces the total inference time significantly, enabling a theoretical upper limit of over 0.82 ms TPOT, approximately 1200 tokens per second. While this figure is purely theoretical and has not been empirically validated, it vividly illustrates the transformative potential of high-bandwidth scale-up networks in accelerating large-scale model inference.

While MoE models exhibit good scalability, achieving high in- ference speeds by increasing hardware resources alone is cost- prohibitive. Therefore, software and algorithms must also con- tribute to improving inference efficiency.

2.3.3    Multi-Token Prediction. Inspired by   Gloeckle  et al.  [36], DeepSeek-V3 introduces a Multi-Token Prediction (MTP) frame- work, which simultaneously enhances model performance and im- proves inference speed. During inference, traditional autoregressive  models generate one token at a decoding step, leading to sequential  bottlenecks. MTP mitigates this issue by enabling the model to gen- erate additional candidate tokens at a lower cost and verify them in  parallel, similar to previous self-drafting-based speculative decod- ing approaches [14,48]. This framework significantly accelerates  inference without compromising accuracy.

As illustrated in the top part of Figure1, each MTP module uses a  single layer, which is much more lightweight than the full model, to  predict additional tokens, enabling parallel verification of multiple  candidate tokens. Although slightly hurting the throughput, this  approach significantly improves the end-to-end generation latency. The real world practice data demonstrates that an MTP module  achieves an acceptance rate of 80% to 90% for predicting the second  subsequent token, which increases the generation TPS by 1.8x compared to the scenario without the MTP module.

Moreover, by predicting multiple tokens per step, MTP increases  the inference batch size, which is crucial for boosting EP computa- tional intensity and hardware utilization. Such algorithmic innova- tions are vital for fast and cost-effective inference in DeepSeek-V3.

2.3.4    High  Inference Speed for Reasoning Models and Test-Time Scaling. Test-time scaling in LLMs, exemplified by OpenAI’s o1/o3  series [60,61], has enabled significant advances in mathematical  reasoning, programming, and general reasoning by dynamically ad- justing computational resources during inference. Subsequent mod- els—including DeepSeek-R1 [28], Claude-3.7 Sonnet [9], Gemini  2.5 Pro [38], Seed1.5-Thinking [68], and Qwen3 [71]—have adopted  similar strategies and achieved notable improvements in these tasks.

2.4 Technique Validation Methodology

Each acceleration technique undergoes rigorous empirical valida- tion to evaluate its accuracy impact, including MLA, FP8 mixed- precision computation, and network co-designed MoE gate rout- ing. Given the prohibitive cost of exhaustive ablation on full-scale models, we adopt a hierarchical and resource-efficient validation  pipeline. Each technique is first validated extensively on small- scale models, followed by minimal large-scale tuning, and finally integrated in a single, comprehensive training run.

For instance, we first conducted fine-grained FP8 training abla- tion studies on both 16B and 230B DeepSeek-V2 models before final  integration. Under these controlled settings, the relative accuracy loss compared to BF16 remains below 0.25%, attributable to our use of high-precision accumulation and fine-grained quantization  strategies.

3 Low-Precision Driven Design

3.1 FP8 Mix-Precision Training

Quantization techniques such as GPTQ [32] and AWQ [51] have  been widely used to reduce bit-widths to 8-bit, 4-bit, or even lower, significantly reducing memory requirements. However, these tech- niques are primarily applied during inference to save memory, rather than in the training phase. NVIDIA’s Transformer Engine  has supported FP8 mixed-precision training for some time, but  prior to DeepSeek-V3, there were no open-source large models  leveraging FP8 for training. Through deep collaboration between  our infrastructure and algorithm teams, and after extensive experi- mentation and innovation, we developed an FP8-compatible train- ing framework for MoE models. Figure 1shows the computational  components where FP8-precision forward and backward processes  are utilized in the training pipeline. Fine-grained quantization is  applied, i.e., tile-wise 1x128 quantization for activations and block- wise 128x128 quantization for model weights. Further technical  details of our FP8 framework are documented in the DeepSeek-V3  technical report [26], and our fine-grained FP8 GEMM implemen- tation has been open-sourced in DeepGEMM [77].

3. 1. 1 Limitations: While FP8 has great potential for accelerating training, several hardware limitations need to be addressed to fully exploit its capabilities:

FP8 Accumulation Precision: FP8 uses constrained accumula- tion precision in Tensor Cores, affecting the stability for training  large models, particularly on NVIDIA Hopper GPUs. After align- ing 32 mantissa products by right-shifting based on the maxi- mum exponent, the Tensor Core only maintains their highest 13 fraction bits for addition, and truncates bits exceeding this range. Addition results are accumulated to FP22 registers (1 sign bit, 8 exponent bits, and 13 mantissa bits).

Fine-Grained Quantization Challenges: Fine-grained quanti- zation such as tile-wise and block-wise quantization introduces  large dequantization overhead in transporting the partial resultfrom Tensor Cores to CUDA Cores for scaling factor multiplica- tion. This incurs frequent data movements, reducing computa- tional efficiency and complicating hardware utilization.

3. 1.2 Suggestions: To address the limitations of existing hardware, we have the following suggestions for future designs:

Increased Accumulation Precision: Hardware should im- prove the accumulation register precision to an appropriate value (e.g. FP32), or support a configurable accumulation precision,enabling a trade-off between performance and accuracy for dif-ferent requirements of training and inference in various models.  

Native Support for Fine-Grained Quantization: Hardware should natively support fine-grained quantization, enabling Ten- sor Cores to receive scaling factors and implement matrix mul-tiplication with group scaling. In this way, the whole partialsum accumulation and dequantization can be completed directly inside Tensor Cores until the final result is produced, avoiding frequent data movements to reduce dequantization overhead. A notable industrial implementation of this approach is NVIDIA Blackwell’s support for microscaling data format [66], which  exemplifies the practical benefits of native quantization at scale.

3.2 LogFMT: Communication Compression

In the current DeepSeek-V3 architecture, we employ low-precision compression for network communication. During EP parallelism, tokens are dispatched using fine-grained FP8 quantization, reducing communication volume by 50% compared to BF16. This significantly lowers communication time. While the combine stage still uses higher precision (e.g., BF16) due to accuracy requirements, we are actively testing FP8, custom precision formats (e.g., E5M6) and mixing FP8-BF16 for further reductions.

Besides these traditional floating point formats, we also tried a new data type, named Logarithmic Floating-Point Formats  (LogFMT-nBit), where n is the number of bits with the leading 1 bit as the sign bit S. By mapping the activations from the original  Linear space to the Log space, the distribution of the activations is  more uniform. To be specific, given a tile of elements, [x1, · · ·  , xm] , which is 1x128 in our implementation, we take the absolute values  and compute the logarithm of all the elements, and find the mini- mum min = log(abs (xi)) and maximum max = log(abs (xj)) . The  minimum is encoded as S.00 · · · 01 and the maximum is encoded  as S.11 · · · 11, with an interval representing Step =   . Zero  values are represented by S.00 · · · 00, specially. The left values are  rounded to the nearest integer K multiples of Step. The decoding  process is simple by combining the sign bit and expmin+step × (k −1) .

By locally calculating the min and Step, this data type supports dynamic representation range for different blocks, covering larger ranges or providing more precision, compared to static floating  point formats. Besides, we find it is important to round in the  original Linear space, instead of the Log space, for the unbiased  activation quantization. We also constrain the min to be larger than  max − log(232), which means that the max representation range  is similar to E5, a floating point with 5 exponents. We validate  our LogFMT-nBit on dense language models with around 7 billion  parameters, by quantifying the output of the residual branch to  simulate the combine stage in MoE models. When setting n = 8, sharing the same bits with FP8, the LogFMT-8Bit shows superior

Figure 2: H800 node interconnection.

training accuracy compared to E4M3 or E5M2. After increasing the n to 10 bits, we find it’s similar to the BF16 combine stage.

3.2. 1 Limitations: The initial purpose of using LogFMT is to apply it to activations during transmission or near activation functions, as it offers higher precision than FP8 with the same bit width. However, subsequent computations require reconversion to BF16  or FP8 to accommodate the Hopper GPU tensor cores’ data type. Due to insufficient GPU bandwidth for log/exp operations and  excessive register pressure during encode/decode, if encode/decode  operations are fused with all-to-all communication, the overhead  can be substantial (50%∼100%). Therefore, although experimental  results validate the effectiveness of this format, we do not employ it eventually.

3.2.2 Suggestions: Providing native support for compression and decompression units tailored to FP8 or custom precision formats represents a viable approach for future hardware. This could help minimize bandwidth requirements and streamline communication pipelines. The reduced communication overhead is particularly helpful in bandwidth-intensive tasks like MoE training.

4 Interconnection Driven Design

4.1 Current Hardware Architecture

The NVIDIA H800 GPU SXM architecture we currently use, illus- trated in Figure 2, is built on the Hopper architecture, similar to  the H100 GPU. However, it features reduced FP64 computational  performance and NVLink bandwidth for regulatory compliance. Specifically, the NVLink bandwidth in H800 SXM nodes is reduced  from 900 GB/s to 400 GB/s. This significant reduction in intra-node  scale-up bandwidth presents a challenge for high-performance  workloads. To compensate, each node is equipped with eight 400G  Infiniband (IB) CX7 NICs, enhancing scale-out capabilities to miti- gate the bandwidth deficit.

To address these hardware constraints, the DeepSeek-V3 model incorporates several design considerations that align with the hard- ware’s strengths and limitations.

4.2 Hardware-Aware Parallelism

To align with the constraints of the H800 architecture, the following parallelism strategies were considered to optimize the performance of DeepSeek-V3:

Avoidance of Tensor Parallelism (TP): Tensor Parallelism  is avoided during training due to its inefficiency under limited  NVLink bandwidth. However, during inference, TP can still be  selectively used to reduce latency and improve TPOT perfor- mance.

Enhanced Pipeline Parallelism (PP): DualPipe [29] is em- ployed to overlap attention and MoE computation with MoE  communication. This also reduces pipeline bubbles and balances  memory usage across GPUs, improving overall throughput. Ad- ditional details are available in the technical report [26].

Accelerated Expert Parallelism (EP): With eight 400Gbps In- finiBand (IB) NICs, the system achieves all-to-all communication  at speeds exceeding 40GB/s. Notably, our all-to-all EP implemen- tation, DeepEP [78], is open-sourced, enabling highly efficient  expert parallelism as discussed in the following subsection.

4.3 Model Co-Design: Node-Limited Routing

The bandwidth disparity between scale-up (intra-node) and scale- out (inter-node) communication in the H800 architecture is approx- imately 4:1. Specifically, NVLink provides 200GB/s bandwidth (of which about 160GB/s can actually be achieved),while each 400Gbps IBNIC delivers only 50GB/s bandwidth (we consider small message  size and latency influence, use 40GB/s for effective bandwidth). To balance and fully utilize the higher intra-node bandwidth, the  model architecture is co-designed with hardware, particularly in  the TopK Expert Selection Strategy.

Consider a setup with 8 nodes (64 GPUs in total) and 256 routed experts (4 experts per GPU). For DeepSeek-V3, each token is routed  to one shared expert and 8 routed experts. If its 8 target experts  are distributed across all 8 nodes, the communication time over IB would be 8t, where t represents the time to send one token  over IB. However, by leveraging the higher NVLink bandwidth, tokens routed to the same node can be sent once over IB and then  forwarded via NVLink to other intra-node GPUs. The NVLink for- warding enables deduplication of the IB traffic. When the target  experts for a given token are distributed across M nodes, the dedu- plicated IB communication cost will be reduced to Mt (M < 8).

Since the IB traffic depends on only M, DeepSeek-V3 introduces a Node-Limited Routing for the TopK expert selection strategy. Specifically, we group 256 routed experts into 8 groups, with 32  experts per group, and deploy each group on a single node. On top  of this deployment, we algorithmically ensure that each token will  be routed to up to 4 nodes. This approach mitigates the bottleneck of IB communication and enhances the effective communication  bandwidth during training.

4.4 Scale-Up and Scale-Out Convergence

4.4. 1 Limitations of  Current Implementations. While the Node- Limited Routing strategy reduces communication bandwidth re- quirements, it complicates communication pipeline kernel imple- mentations due to the disparity in bandwidth between intra-node  (NVLink) and inter-node (IB) interconnects. In practice, GPU Stream- ing Multiprocessors (SM) threads are used for both network mes- sage handling (e.g., filling QPs and WQEs) and data forwarding over  NVLink, consuming computational resources. For example, during  training, up to 20 of the SMs on the H800 GPU are allocated for communication-related operations, leaving fewer resources avail- able for actual computation. To maximize throughput in online  inference, we perform EP all-to-all communication entirely through  NIC RDMA, avoiding SM resource contention and improving com- pute efficiency. This highlights the advantage of RDMA’s asyn- chronous communication model in overlapping computation and  communication.

The following are key tasks currently performed by SMs during EP communication, particularly for the combine stage’s reduce  operations and data type conversions. Offloading these tasks to ded- icated communication hardware could free up SMs for computation  kernels, significantly improving overall efficiency:

Forwarding Data: Aggregating IB traffic destined for multiple  GPUs within the same node between the IB and NVLink domains.

Data Transport: Moving data between RDMA buffers (regis- tered GPU memory regions) and input/output buffers.

Reduce Operations: Executing reduce operations required for EP all-to-all combine communications.

Managing Memory Layouts: Handling fine-grained memory layouts for chunked data transfers across the IB and NVLink domains.

Data Type Cast: Converting data type before and after all-to- all communications.

4.4.2 Suggestions: To address these inefficiencies, we strongly rec- ommend that future hardware should integrate intra-node (scale- up) and inter-node (scale-out) communication into a unified frame- work. By incorporating dedicated co-processors for network traffic  management and seamless forwarding between NVLink and IB do- mains, such designs can reduce software complexity and maximize  bandwidth utilization. For example, node-limited routing strategies  employed in DeepSeek-V3 can be further optimized with hardware  support for dynamic traffic deduplication.

We also recognize emerging interconnect protocols such as the Ultra Ethernet Consortium (UEC) [17,18], Ultra Accelerator Link (UALink) [16], both of which are poised to drive advancements in  scale-up and scale-out communication. More recently, Unified Bus  (UB) [49] has introduced a novel approach to scale-up and scale-out  convergence. Section 6further explores several technical innova- tions proposed by UEC and UALink. However, in this section, our primary focus is on achieving scale-up and scale-out convergence  at the programming framework level.:

(1) Unified Network Adapter: Design NICs (Network Interface Cards) or I/O Dies that are connected to unified scale-up and  scale-out networks. These adapters should also support basic  switch functionality, such as forwarding packets from the scale- out network to specific GPUs within the scale-up network. This  could be achieved using a single LID (Local Identifier) or IP  address with policy-based routing.

(2) Dedicated Communication Co-Processor: Introduce a ded- icated co-processor or programmable component—such as an I/O die—for handling network traffic. This component would of- fload packet processing from GPU SMs, preventing performance  degradation. Besides, it should include hardware-accelerated  memory copy capabilities for efficient buffer management.

(3) Flexible Forwarding, Broadcast and Reduce Mechanisms: Hardware should support flexible forwarding, broadcast opera- tions (for EP dispatch), and reduce operations (for EP combineacross scale-up and scale-out networks—mirroring our current  GPUSM-based implementation. This would not only improve ef- fective bandwidth but also reduce the computational complexity of network-specific operations.

(4) Hardware Synchronization Primitives: Provide fine-grained hardware synchronization instructions to handle memory con- sistency issues or out-of-order packet arrivals at the hardware  level. This would eliminate the need for software-based syn- chronization mechanisms like RDMA completion events, which  introduce extra latency and increase programming complex- ity. Memory-semantic communication with an acquire/release  mechanism is a promising implementation.

By implementing these recommendations, future hardware de- signs can significantly enhance the efficiency of large-scale dis- tributed AI systems while simplifying software development.

4.5 Bandwidth Contention and Latency

4.5. 1 Limitations: Besides, current hardware lacks the flexibility to dynamically allocate bandwidth between different types of traffic on NVLink and PCIe. For example, during inference, transferring KV cache data from CPU memory to GPU can consume tens of GB/s, saturating PCIe bandwidth. If the GPU simultaneously uses IB for EP communication, this contention between KV cache transfers and EP communication can degrade overall performance and cause latency spikes.

4.5.2 Suggestions:

Dynamic NVLink/PCIe Traffic Prioritization: Hardware  should support dynamic prioritization of traffic based on its type. For example, traffic related to EP, TP, and KV cache transfers  should be assigned different priorities to maximize interconnect  efficiency. For PCIe, exposing the traffic class (TC) to user-level  programming would suffice.

I/O Die Chiplet Integration: Integrating NICs directly into the  I/O die and connecting them to the compute die in the same pack- age, rather than through conventional PCIe, would substantially reduce communication latency and alleviate PCIe bandwidth  contention.

CPU–GPU Interconnects within the Scale-Up Domain: To  further optimize intra-node communication, CPUs and GPUs  should be interconnected using NVLink or similar dedicated  high-bandwidth fabrics, rather than relying solely on PCIe. Sim- ilar to the benefits provided by integrating NICs into the I/O die, this approach can significantly improve scenarios such as offload- ing parameters or KV cache between GPU and CPU memory during training and inference.

5 Large Scale Network Driven Design

5.1 Network Co-Design: Multi-Plane Fat-Tree

During the training of DeepSeek-V3, we deployed a Multi-Plane Fat-Tree (MPFT) scale-out network, as shown in Figure 3. Each  node is equipped with eight GPUs and eight IB NICs, with each  GPU–NIC pair assigned to a distinct network plane. Additionally, each node has a 400 Gbps Ethernet RoCE NIC connected to a sepa- rate storage network plane for accessing the 3FS [30] distributed file  system. In the scale-out network, we used 64-port 400GIB switches, enabling the topology theoretically supports up to 16,384 GPUs

Figure 3: Eight-plane two-layer fat-tree scalue-out network: Each GPU and IB NIC pair belongs to one network plane. Cross-plane traffic must use another NIC and PCIe or NVLink for intra-node forwarding.

while retaining the cost and latency advantages of a two-layer net- work. However, due to policy and regulatory constraints, just over two thousand GPUs were ultimately deployed.

Furthermore, due to the current limitations of IB ConnectX-7, our deployed MPFT network does not fully realize the envisioned architecture. Ideally, as depicted in Figure 4, each NIC would fea- ture multiple physical ports, each connected to a separate network plane, yet collectively exposed as a single logical interface to the  user through port bonding. From a user perspective, a single Queue  Pair (QP) could seamlessly transmit and receive messages across all  available ports, akin to packet spraying. As a consequence, packets  originating from the same QP may traverse distinct network paths  and arrive at the receiver out of order, thereby necessitating native  support for out-of-order placement within the NIC to guarantee  message consistency and preserve the correct ordering semantics. For example, InfiniBand ConnectX-8 natively supports four plane. It would be advantageous for future NICs to fully support advanced  multi-plane capabilities, allowing two-tier fat-tree networks to scale  effectively to much larger AI clusters. Overall, the multi-plane archi- tecture offers significant advantages in fault isolation, robustness, load balancing, and large-scale system scalability.

5.1.1 Advantages of Multi-Plane Fat-Tree Network.

Subset of Multi-Rail Fat-Tree (MRFT): The MPFT topology constitutes a specific subset of the broader MRFT architecture. As  a result, existing optimizations developed by NVIDIA and NCCL for Multi-Rail networks can be seamlessly leveraged within  Multi-Plane network deployments. Furthermore, NCCL’s sup- port for PXN [54] technology addresses the inherent challenge  of inter-plane isolation, enabling efficient communication even  when direct interconnectivity between planes is absent.

Cost Efficiency: As shown in Table3, the multi-plane network enables over 10k endpoints using a two-layer fat-tree (FT2) topol- ogy, significantly reducing network costs compared to a three- layer fat tree (FT3). The cost per endpoint is even slightly more  competitive than the cost-efficient Slim Fly (SF) topology [12].

Traffic Isolation: Each plane operates independently, ensuring  that congestion in one plane does not affect others. This isola- tion improves overall network stability and prevents cascading  performance degradation.

Figure 4: Ideal Multi-Plane Network: Each NIC is equipped with multiple physical ports, each connected to a distinct network plane. A single queue pair (QP) can simultaneously utilize all available ports for transmitting and receiving packets, which necessitates native support for out-of-order placement within the NIC.

Table 3: Network topology comparison. Cost estimates are derived from the methodology in the Slim Fly (SF) paper [12]. DF denotes the canonical dragonfly topology [22,46,65].

Metric

FT2

MPFT

FT3

SF

DF

Endpoints

2,048

16,384

65,536

32,928

261,632

Switches

96

768

5,120

1,568

16,352

Links

2,048

16,384

131,072

32,928

384,272

Cost [M$]

9

72

491

146

1,522

Cost/Endpoint [k$]

4.39

4.39

7.5

4.4

5.8

Latency Reduction: The two-layer topology achieves lower latency than three-layer fat trees, as demonstrated in our experi- ments. This makes it particularly suitable for latency-sensitive  applications such as MoE-based training and inference.

Robustness: As shown in Figure4, multi-port NICs provide mul- tiple uplinks, so single-port failures do not disrupt connectivity and rapid, transparent fault recovery is possible.

It is important to note that, due to current 400G NDR InfiniBand limitations, cross-plane communication requires intra-node for- warding, which introduces additional latency during inference. If future hardware can achieve scale-up and scale-out network conver- gence as discussed earlier, this latency can be significantly reduced, further enhancing the viability of multi-plane networks.

5. 1.2 Performance Anlaysis. To verify the effectiveness of the Multi- Plane Network design, we conducted real-world experiments on  our cluster, modifying the clusters network topology to compare  the performance of the Multi-Plane Two-Layer Fat Tree (MPFT) and the Single-Plane Multi-Rail Fat Tree (MRFT). Below are  the key findings from our experiments:

1. All-to-All Communication and EP Scenarios: As illus- trated in Figure 5, the all-to-all performance of the multi-plane  network is very similar to that of the single-plane multi-rail net- work. This performance parity can be attributed to NCCLs PXN[54] mechanism, which optimizes traffic forwarding via NVLink in multi-rail topologies. The multi-plane topology also benefits  from this mechanism. As shown in Figure 6, the results of all-to- all communication tests conducted on 16 GPUs reveal negligible  differences in latency between the MPFT and MRFT topologies.

Figure 5: NCCL all-to-all performance from 32 to 128 GPUs for MRFT and MPFT networks.

To evaluate MPFTs performance of all-to-all communication in practical training scenarios, we tested the EP communication patterns commonly used during training. As shown in Figure7, each GPU achieves a high bandwidth exceeding 40GB/s in a multi-plane network, providing reliable performance that meets the demands of training.

2Training Throughput for DeepSeek-V3 Model: We also compare the training metrics of the DeepSeek-V3 model between  MPFT and MRFT in Table 4. MFU (Model Flops Utilization) is cal- culated based on BF16 peak performance. Causal MFU only takes  into account the flops of the lower triangle of the attention ma- trix (in line with FlashAttention [19,20]), while non-causal MFU  includes the flops of the whole attention matrix (in line with Mega- tron [47]). 1F, 1B, and 1W denote forward time, input backward  time, and weight backward time, respectively. When training the V3  model on 2048 GPUs, the performance of MPFT is nearly identical  to that of MRFT, with observed differences falling within normal  fluctuations and measurement error.

5.2 Low Latency Networks

In our model inference, large-scale EP relies heavily on all-to-all communication, which is highly sensitive to both bandwidth and latency. Consider a typical scenario discussed in Section2.3.2, with a network bandwidth of 50GB/s, the data transfer should ideally take approximately 120 μs . Therefore, the intrinsic network latencies on the order of microseconds can critically impact system performance, making their effects non-negligible.

5.2.1 IB or RoCE. As shown in Table 5, IB consistently achieves lower latency, making it the preferred choice for latency-sensitive

Figure 6: Latency comparison between MPFT and MRFT net- works in NCCL all-to-all test under different message sizes, showing that their performance is nearly identical.

Figure 7: DeepEP performance on MPFT: The EP dispatch and combine kernel communicates across 16 to 128 GPUs using all-to-all. Each GPU processes 4096 tokens. The observed throughput nearly saturates the 400Gps NIC bandwidth.

workloads such as distributed training and inference. Although IB has superior latency performance compared to RDMA over Con- verged Ethernet (RoCE), it comes with certain limitations:

Cost: IB hardware is significantly more expensive than RoCE solutions, which limits its widespread adoption.

Scalability: IB switches typically support only 64 ports per switch, compared to the 128 ports commonly found in RoCE switches. This restricts the scalability of IB-based clusters, particularly for large-scale deployments.

5.2.2 Recommendations for RoCE Improvements. While RoCE has the potential to be a cost-effective alternative to IB, its current limitations in latency and scalability prevent it from fully meeting the demands of large-scale AI systems. Below, we outline specific recommendations for improving RoCE:

(1) Specialized Low-Latency RoCE Switches: We recommend that Ethernet vendors develop RoCE switches specifically opti- mized for RDMA workloads by removing unnecessary Ether- net features. The Slingshot architecture [22] exemplifies how Ethernet-based designs can achieve latency performance com- parable to IB. Similarly, recent innovations from Broadcom [13], including the AI Forwarding Header (AIFH) and upcoming low- latency Ethernet switches, demonstrate the feasibility of high- performance Ethernet fabrics tailored for AI. We are looking  forward to continuing innovation in this direction.

Table 4: Training metric comparison between MPFT and MRFT networks.

Metric

MPFT

MRFT

tokens/day (B)

272.80

272.52

1F (s)

1.13

1.13

bubble (s)

2.06

2.03

1B (s)

1.99

1.99

1W (s)

0.48

0.48

1F1B (s)

13.95

14.00

opt (s)

0.29

0.31

TFLOPS (non-causal)

432

432

TFLOPS (causal)

385

385

MFU (non-causal)

43.73%

43.68%

MFU (causal)

38.94%

38.90%

Table 5: CPU side end-to-end latency comparison between IB, RoCE, and intra-node NVLink for 64B data transmission.

Link Layer

Same Leaf

Cross Leaf

RoCE

3.6us

5.6us

InfiniBand

2.8us

3.7us

NVLink

3.33us

-

(2) Optimized Route Policy: As shown in Figure 8, the default  Equal-Cost Multi-Path (ECMP) routing policy in RoCE struggles  to distribute traffic efficiently across interconnects, leading to  severe congestion performance degradation in NCCL collective  communication tests. LLM training traffic, such as in DP (Data  Parallelism), tends to lack randomness, causing multiple flows  to converge on the same interconnect link. In contrast, Adaptive  Routing (AR) [34] can significantly enhance network perfor- mance by dynamically spraying packets across multiple paths. While static routingbased on manually configured route ta- blescan avoid link conflicts for specific destinations, it lacks  flexibility. For large-scale all-to-all communication, adaptive  routing offers superior performance and scalability.

(3) Improved Traffic Isolation or Congestion Control Mecha- nisms: Current RoCE switches support only a limited number of priority queues, which are insufficient for complex AI workloads involving concurrent communication patterns such as EPs all- to-all and DPs all-reduce. In such mixed workloads, all-to-all  traffic can cause incast congestion due to bursty many-to-one  transfers, potentially degrading overall network performance. To address incasts influence on other traffic, one approach is  to adopt virtual output queuing (VOQ), assigning a dedicated  virtual queue to each QP to isolate traffic flows. Alternatively, more effective congestion control (CC) mechanisms such as  RTT-based CC (RTTCC) or user-programmable CC (PCC) can  be employed, enabling NICswitch co-optimization to main- tain low latency and high throughput under dynamic traffic  conditions.

5.2.3 InfiniBand GPUDirect Async (IBGDA). We utilize IBGDA [2, 57] to reduce latency in network communications. Traditionally, network communication involves the creation of a CPU proxy thread: once the GPU has prepared the data, it must notify the CPU proxy, which then populates the control information for th   InfiniBand GPUDirect Async (IBGDA). We utilize IBGDA [2, 57] to reduce latency in network communications. Traditionally, network communication involves the creation of a CPU proxy thread: once the GPU has prepared the data, it must notify the CPU proxy, which then populates the control information for the

Figure 8: RoCE network bandwidth of AllGather and Re- duceScatter communication primitives under different rout- ing methods (ECMP, AR, Static Routing) and TP dimensions.

work request (WR) and signals the NIC via a doorbell mechanism to initiate data transmission. This process introduces additional communication overhead.

IBGDA addresses this issue by allowing the GPU to directly fill the WR content and write to the RDMA doorbell MMIO address. By managing the entire control plane within the GPU, IBGDA elim- inates the significant latency overhead associated with GPU-CPU  communication. Moreover, when sending a large number of small  packets, the control plane processor can easily become a bottleneck. Since GPUs have multiple parallel threads, the sender can leverage  these threads to distribute the workload, thereby avoiding such bot- tlenecks. A range of worksincluding our DeepEP [78]have lever- aged IBGDA and reported substantial performance gains [1,15,79]. We therefore advocate for such capabilities to be widely supported  across accelerator devices.

6 Discussion and Insights for Future Hardware Architecture Design

Building on the previous sections, we summarize key architectural insights and outline future directions for hardware design tailored to large-scale AI workloads.

Section 2.3.2highlighted the importance of large-scale scale-up  networks for accelerating model inference. Section 3discussed the  necessity of efficient support for low-precision computation and  communication. Section4explored the convergence of scale-up and  scale-out architectures, along with several proposed enhancements. Section 5focused on multi-plane network topologies and identified  key improvements needed for Ethernet-based interconnects.

Together, these sections identify hardware limitations in concrete application contexts and offer corresponding suggestions. Building on that foundation, this section expands the discussion to broader considerations and proposes forward-looking directions for future hardware architecture design.

6.1 Robustness Challenges

6.1.1 Limitations:

Interconnect Failures: High-performance interconnects (e.g., IB and NVLink) are prone to intermittent disconnections, which can disrupt node-to-node communication. This is especially harmful in communication-heavy workloads like EP, where even brief interruptions may lead to significant performance drops or job failures.

Single Hardware Failures: Node crashes, GPU failures, or ECC (Error-Correcting Code) memory errors can compromise longrunning training jobs, often requiring costly restarts. The impact of such failures escalates in large-scale deployments, where the probability of a single-point failure increases proportionally with system size.

Silent Data Corruption: Errors undetected by ECC mechanisms, such as multi-bit memory flips or computational inaccuracies, pose a significant risk to model quality. These errors are particularly insidious in long-running tasks, as they can propagate undetected and corrupt downstream computations. Current mitigation strategies rely on application-level heuristics, which are insufficient for ensuring system-wide robustness.

6.1.2 Suggestions for Advanced Error Detection and Correction. To mitigate risks associated with silent corruption, hardware must in corporate advanced error detection mechanisms beyond traditional ECC. Techniques such as checksum-based validation or hardware

accelerated redundancy checks can provide higher reliability for large-scale deployments.

Furthermore, hardware vendors should deliver comprehensive diagnostic toolkits to end users, empowering them to rigorously verify the integrity of their systems and proactively identify any latent silent data corruption. Such toolkits, when embedded as part of the standard hardware package, foster transparency and enable continuous validation throughout the operational lifecycle, thereby bolstering overall system trustworthiness.

6.2 CPU Bottlenecks and Interconnects

While accelerator design often takes center stage, CPUs remain essential for coordinating computation, managing I/O, and sustain- ing system throughput. However, current architectures face several  critical bottlenecks:

First, as discussed in Section 4.5, the PCIe interface between  CPUs and GPUs often becomes a bandwidth bottleneck, particu- larly during large-scale parameter, gradient, or KV cache transfers. To mitigate this, future systems should adopt direct CPUGPU  interconnectssuch as NVLink or Infinity Fabricor integrate both  CPUs and GPUs into the scale-up domain, thereby eliminating  intra-node bottlenecks.

In addition to PCIe limitations, sustaining such high data trans- fer rates also requires exceptionally high memory bandwidth. For example, saturating 160 lanes of PCIe 5.0 demands over 640 GB/s per node, translating to a memory bandwidth requirement of ap- proximately 1 TB/s per nodeposing a significant challenge for conventional DRAM architectures.

Lastly, latency-sensitive tasks such as kernel launches and net- work processing demand high single-core CPU performance, typi- cally requiring base frequencies above 4 GHz. Furthermore, mod- ern AI workloads require sufficient CPU cores per GPU to prevent control-side bottlenecks. For chiplet-based architectures, additional  cores are needed to support cache-aware workload partitioning  and isolation.

6.3 Toward Intelligent Networks for AI

To meet the demands of latency-sensitive workloads, future inter- connects must prioritize both low latency and intelligent networks:

Co-Packaged Optics: Incorporating silicon photonics enables  scalable higher bandwidth scalability and enhanced energy effi- ciency, both are critical for large-scale distributed systems.

Lossless Network: Credit-Based Flow Control (CBFC) mecha- nisms ensures lossless data transmission, yet naively triggering  flow control can induce severe head-of-line blocking. Therefore, it is imperative to deploy advanced, endpoint-driven congestion  control (CC) algorithms that proactively regulate injection rates  and avert pathological congestion scenarios.

Adaptive Routing: As underscored in Section 5.2.2, future  network should standardize the adoption of dynamic routing  schemessuch as packetspraying and congestion-aware path se- lectionthat continuously monitor real-time network conditions  and intelligently redistribute traffic. These adaptive strategies are  particularly effective in alleviating hotspots and mitigating bot- tlenecks during collective communication workloads, including  all-to-all and reduce-scatter operations.

Efficient Fault-Tolerant Protocols: Robustness against fail- ures can be significantly enhanced through the deployment of self-healing protocols, redundant ports, and rapid failover tech- niques. For instance, link-layer retry mechanisms and selective  retransmission protocols prove indispensable in scaling reliabil- ity across large networks, minimizing downtime and ensuring  seamless operation despite intermittent failures.

Dynamic Resource Management: To handle mixed workloads  effectively, future hardware should enable dynamic bandwidth  allocation and traffic prioritization. For example, inference tasks  should be isolated from training traffic in unified clusters, ensur- ing responsiveness for latency-sensitive applications.

6.4 Discussion on Memory-Semantic Communication and Ordering Issue

Inter-node communication using load/store memory semantics is efficient and programmer-friendly, but current implementations are hampered by memory ordering challenges. For example, after writing data, the sender must issue an explicit memory barrier (fence) before updating a flag to notify the receiver, ensuring data  consistency. This strict ordering introduces additional round-trip  time (RTT) latency and can stall the issuing thread, impeding in- flight stores and reducing throughput. Similar out-of-order syn- chronization issues arise in message-semantic RDMA; for instance, performing RDMA atomic add operations with packet spraying  after regular RDMA writes on InfiniBand or NVIDIA BlueField-3  can incur additional RTT latency.

To address these, we advocate for hardware support that offers built-in ordering guarantees for memory-semantic communication. Such consistency should be enforced both at the programming level  (e.g., via acquire/release semantics) and by hardware at the receiver, enabling in-order delivery without added overhead.

Several approaches are possible. For instance, the receiver could buffer atomic messages and use packet sequence numbers for in- order processing. However, an acquire/release mechanism is both  more elegant and efficient. We suggest a simple conceptual mecha- nism, Region  Acquire/Release  (RAR) mechanism, wherein re- ceiver hardware maintains a bitmap to track the state of the RNR  memory region, and acquire/release operations are scoped to the RAR address range. With minimal bitmap overhead, this enables  efficient, hardware-enforced ordering, eliminating explicit sender- side fences and delegating ordering to hardwareideally on the  NIC or I/O die. Importantly, the RAR mechanism benefits not only memory-semantic operations but also message-semantic RDMA primitives, thus broadening its practical applicability.

6.5 In-Network Computation and Compression

EP involves two critical all-to-all stagesdispatch and com- binethat present significant opportunities for in-network opti- mization. The dispatch stage resembles a small-scale multicast operation, where a single message must be forwarded to multi- ple target devices. A hardware-level protocol enabling automatic  packet replication and forwarding to multiple destinations could  drastically reduce communication overhead and improve efficiency.

The combine stage, acting as a small-scale reduction operation, could benefit from in-network aggregation techniques. However, due to the small reduction scope and imbalanced workload in EP combine, implementing in-network aggregation in a flexible man- ner is challenging.

Moreover, as highlighted in Section 3.2, LogFMT enables low- precision token transmission with minimal impact on model per- formance. Incorporating LogFMT natively within network hard- ware could further optimize communication by increasing entropy density and reducing bandwidth usage. Hardware-accelerated com- pression and decompression would allow seamless integration of LogFMT into distributed systems, enhancing overall throughput.

6.6 Memory-Centric Innovations

6.6.1 Limitations of Memory Bandwidth. The exponential growth  in model sizes has outpaced advancements in high-bandwidth mem- ory (HBM) technology. This disparity creates a memory bottleneck, particularly in attention-heavy architectures like Transformers.

6.6.2 Suggestions:

DRAM-Stacked Accelerators: Leveraging advanced 3D stack- ing technologies, DRAM dies can be vertically integrated atop  a logic die, thereby enabling exceptionally high memory band- width, ultra-low latency, and a practical memory capacity (though stack-limited). This architectural paradigm proves remarkably advantageous for ultra-fast inference in MoE models, where  memory throughput is a critical bottleneck. Architectures such  as SeDRAM [72] exemplify the potential of this approach, deliver- ing unprecedented performance for memory-bound workloads.

System-on-Wafer (SoW): Wafer-scale integration [50] can max- imize computational density and memory bandwidth, addressing  the needs of ultra-large-scale models.

7 Conclusion

DeepSeek-V3 exemplifies the transformative potential of hardware- software co-design in advancing the scalability, efficiency, and ro- bustness of large-scale AI systems. By addressing the limitations of current hardware architectures and proposing actionable recom- mendations, this paper provides a roadmap for the next generation of AI-optimized hardware. These innovations will be critical as AI  workloads continue to grow in complexity and scale, driving the  future of intelligent systems.

 

References

[1] Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading Commu- nication Control Logic in GPU Accelerated Applications. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 248–257. https://doi.org/10.1109/CCGRID.2017.29

[2] E. Agostini, D. Rossetti, and S. Potluri. 2018. GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters. J. Parallel and Distrib. Comput. 114 (2018), 28–45.  https://doi.org/10.1016/j.jpdc.2017.12.007

[3] AI@Meta. 2024. Llama 3 Model Card.  https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

[4]AI@Meta.2024.Llama3.1ModelCard.https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md

[5]Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico  Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Trans- former Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245  (2023).

[6] AMD.   2025.        AMD   RyzenAIMax+PRO395:Designedtopower anewgenerationof  compactCopilot+PCworkstations.https://www.amd.com/en/products/processors/laptop/ryzen-pro/ai-max-pro-          300-series/amd-ryzen-ai-max-plus-pro-395.html

[7]WeiAnXiaoBi,GuantingChenShanhuangChen,ChengqiDeng,Honghui  Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan,Jianzhong Guo, Yongqiang  Guo, Zhe Fu, Ying He, Panpan Huang,JiashiLi, Wenfeng Liang, Xiaodong Liu, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu, Xiaotao Nie, Tian Pei,Junjie Qiu, Hui Qu, Zehui Ren, Zhangli Sha, XuechengSu, XiaowenSun, YixuanTan, Minghui Tang, Shiyu Wang, Yaohui Wang, Yongji Wang, Ziwei Xie, YiliangXiong,YanhongXu, ShengfengYe, ShuipingYu,Yukun Zha,Liyue Zhang,HaoweiZhang, Mingchuan Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao  Zhao, Shangyan Zhou, Shunfeng Zhou,and YuhengZou.2024.  Fire-Flyer AI-HPC:A Cost-Effective Software-Hardware Co-Design for Deep Learning. In SC24:International ConferenceforHigh Performance Computing, Networking, Storage  andAnalysis.1–23.https://doi.org/10.1109/SC41406.2024.00089

[8] Anthropic. 2024. Claude 3.5 Sonnet.  https://www.anthropic.com/news/claude- 3-5-sonnet

[9] Anthropic.  2025.       Claude  3.7  Sonnet  and  Claude  Code.  https:// www.anthropic.com/news/claude-3-7-sonnet

[10] Apple. 2024. Apple introduces MPro and MMax.  https://www.apple.com/ newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/

[11] Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long- Document Transformer. arXiv:2004.05150 (2020).

[12] Nils Blach, Maciej Besta, Daniele De Sensi,Jens Domke, Hussein Harake, Shigang  LiPatrickIffMarekKoniecznyKartikLakhotiaAlesKubicekMarcelFerrariFabrizio Petrini, and Torsten Hoefler. 2025.A high-performance design, imple-mentation, deployment, and evaluation ofthe slim fly network. In Proceedingsofthe 21stUSENIXSymposiumon Networked Systems Designand Implementation(Santa Clara, CA,USA) (NSDI’24).USENIXAssociation,USA,Article 57, 20pages.

[13] Broadcom. 2025. Scale Up Ethernet Framework.  https://docs.broadcom.com/ doc/scale-up-ethernet-framework

[14] Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming  Chen, and Tri Dao. 2024. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.  https://openreview.net/forum?id=PEpbUobfJv

[15]ShaoyuanChen,WencongXiao,YutongLin,MingxingZhang,YingdiShan, Jinlei  Jiang,  Kang   Chen,and  YongweiWu.   2025. Efficient  Heteroge-neousLargeLanguageModelDecodingwithModel-AttentionDisaggregation. arXiv:2405.01814[cs.LG]https://arxiv.org/abs/2405.01814

[16] ULTRA ACCELERATOR LINK CONSORTIUM. 2025. Introducing UALink 200G 1.0 Specification.  https://ualinkconsortium.org/wp-content/uploads/2025/04/UALink-1.0-White_Paper_FINAL.pdf

[17]UltraEthernetConsortium.2023.  OverviewofandMotivationfortheForth- comingUltraEthernetConsortiumSpecification.  https://ultraethernet.org/wp-content/uploads/sites/20/2023/10/23.07.12-UEC-1.0-Overview-FINAL-WITH- LOGO.pdf

[18] Ultra Ethernet Consortium. 2024.  UEC Progresses Towards v1.0 Set of Spec- ifications.https://ultraethernet.org/uec-progresses-towards-v1-0-set-of- specifications/

[19] Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.

[20]TriDao,DanielY.Fu,StefanoErmon,AtriRudra,andChristopher.2022.FlashAttention:FastandMemory-EfficientExactAttentionwithIO-Awareness.In Advancesin Neural Information Processing Systems.

[21] Tri Dao and Albert Gu. 2024. Transformers are SSMs: generalized models and  efficient algorithms through structured state space duality. In Proceedings of the  41st International Conference on Machine Learning (Vienna, Austria) (ICML’24). JMLR.org, Article 399, 31 pages.

[22]DanieleDeSensiSalvatoreDiGirolamoKimH.McMahon,DuncanRoweth, and Torsten Hoefler. 2020.An In-Depth Analysis ofthe Slingshot Interconnect.In SC20: International ConferenceforHigh Performance Computing, Networking, Storage andAnalysis.1–14.  https://doi.org/10.1109/SC41405.2020.00039

[23] DeepSeek-AI. 2024. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence.   CoRR abs/2406.11931 (2024).    https://doi.org/ 10.48550/arXiv.2406.11931

[24] DeepSeek-AI. 2024.   DeepSeek LLM: Scaling Open-Source Language Models with Longtermism.   CoRR abs/2401.02954  (2024).   https://doi.org/10.48550/ arXiv.2401.02954

[25] DeepSeek-AI. 2024. DeepSeek-V2: A Strong,Economical, and Efficient Mixture-of- Experts Language Model.  CoRR abs/2405.04434 (2024).  https://doi.org/10.48550/ arXiv.2405.04434

[26] DeepSeek-AI.    2024.             DeepSeek-V3    Technical    Report.             (2024). arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

[27] DeepSeek-AI. 2024.   DeepSeekMoE: Towards Ultimate Expert Specialization  in Mixture-of-Experts Language Models.  CoRR abs/2401.06066 (2024).   https: //doi.org/10.48550/arXiv.2401.06066

[28] DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.   arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/ 2501.12948

[29] DeepSeek-AI. 2025.  DualPipeA bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 traininghttps://github.com/ deepseek-ai/dualpipe.

[30] DeepSeek-AI. 2025. Fire-Flyer File System.  https://github.com/deepseek-ai/3FS

[31] DeepSeek-AI. 2025.  Profiling Data in DeepSeek Infra.    https://github.com/ deepseek-ai/profile-data?tab=readme-ov-file#inference

[32] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022.  Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).

[33] AdithyaGangidi, Rui Miao, Shengbao Zheng, SaiJayesh Bondu, Guilherme Goes, HanyMorsyRohitPuriMohammadRiftadiAshmithaJeevarajShetty,Jingyi  Yang,ShuqiangZhang,Mikel JimenezFernandez,ShashidharGandham,andHongyi Zeng. 2024.RDMA over Ethernet for Distributed Training at Meta Scale.In Proceedings ofthe ACM SIGCOMM 2024Conference (Sydney, NSW, Australia)(ACMSIGCOMM’24). AssociationforComputingMachinery,NewYorkNY, USA, 57–70.  https://doi.org/10.1145/3651890.3672233

[34]PatrickGeoffrayandTorstenHoefler.2008.AdaptiveRoutingStrategiesfor ModernHighPerformanceNetworks.In200816thIEEESymposiumonHigh Performance Interconnects.165–172.https://doi.org/10.1109/HOTI.2008.21

[35] Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer. 2024.  AI and Memory Wall .  IEEE Micro 44, 03 (May 2024), 33–39.  https://doi.org/10.1109/MM.2024.3373763

[36] FabianGloeckle,BadrYoubiIdrissi,BaptisteRozière,DavidLopez-Paz,andGabriel Synnaeve. 2024.Better & Faster Large Language Models via Multi-token Prediction.InForty-first InternationalConferenceonMachine Learning,ICML2024,Vienna, Austria, July 21-27, 2024. OpenReview.net.  https://openreview.net/ forum?id=pEWAcejiU2

[37] Google.  2024.     Introducing  Gemini  2.0:  our  new  AI  model  for  the  agen- tic era.   https://blog.google/technology/google-deepmind/google-gemini-ai- update-december-2024

[38] Google. 2025. Gemini 2.5: Our most intelligent AI model.  https://blog.google/ technology/google-deepmind/gemini-model-thinking-updates-march-2025/

[39] MADSys group and Approaching.AI. 2025. A Flexible Framework for Experienc- ing Cutting-edge LLM Inference Optimizations.  https://github.com/kvcache- ai/ktransformers

[40]ColemanHooperSehoonKim,HivaMohammadzadeh,MichaelWMahoney,Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024.KVQuant: Towards10 MillionContextLengthLLMInferencewithKVCacheQuantization.arXiv preprint arXiv:2401.18079 (2024).

[41]AlbertQJiang,AlexandreSablayrolles,ArthurMensch,ChrisBamford,De-vendraSinghChaplot, Diegode lasCasas, Florian Bressand,GiannaLengyel, GuillaumeLample,LucileSaulnier,etal.2023.   Mistral  7B.   arXivpreprint arXiv:2310.06825 (2023).

[42]ZihengJiangHaibinLinYinminZhongQiHuangYangruiChenZhiZhang, YanghuaPengXiangLiCongXieShibiaoNongYuluJiaSunHe,Hongmin  Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, ZhuoJiang,Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao,Liang  Xiang,ZheruiLiu,ZheLiXiaoyingJiaJianxiYeXinJin,andXinLiu.2024.MegaScale: Scaling Large Language Model Training to More Than10,000 GPUs. http://arxiv.org/abs/2402.15627arXiv:2402.15627[cs].

[43] Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA) (ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 82, 14 pages.  https://doi.org/10.1145/3579371.3589350

[44]Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM. arXiv:2403.05527 [cs.LG]

[45]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). arXiv:2001.08361 https://arxiv.org/abs/2001.08361

[46]John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. TechnologyDriven, Highly-Scalable Dragonfly Topology. In 2008 International Symposium on Computer Architecture. 77–88. https://doi.org/10.1109/ISCA.2008.19

[47]Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems 5 (2023).

[48]Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https://openreview.net/forum?id=1NdN7eXyb4

[49]Heng Liao, Bingyang Liu, Xianping Chen, Zhigang Guo, Chuanning Cheng, Jianbing Wang, Xiangyu Chen, Peng Dong, Rui Meng, Wenjie Liu, Zhe Zhou, Ziyang Zhang, Yuhang Gai, Cunle Qian, Yi Xiong, Zhongwu Cheng, Jing Xia, Yuli Ma, Xi Chen, Wenhua Du, Shizhong Xiao, Chungang Li, Yong Qin, Liudong Xiong, Zhou Yu, Lv Chen, Lei Chen, Buyun Wang, Pei Wu, Junen Gao, Xiaochu Li, Jian He, Shizhuan Yan, and Bill McColl. 2025. UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture. arXiv:2503.20377 [cs.AR] https: //arxiv.org/abs/2503.20377

[50]Sean Lie. 2022. Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning : Cerebras Systems. In 2022 IEEE Hot Chips 34 Symposium (HCS). 1–34. https://doi.org/10.1109/HCS55958.2022.9895479

[51]Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. In MLSys.

[52]Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. arXiv preprint arXiv:2402.02750 (2024).

[53]Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. 2025. Large Language Model Agent: A Survey on Methodology, Applications and Challenges. arXiv preprint arXiv:2503.21460 (2025).

[54]Karthik Mandakolathur and Sylvain Jeaugey. 2022. Doubling all2all Performance with NVIDIA Collective Communication Library 2.12. https://developer.nvidia.com/blog/doubling-all2all-performance-withnvidia-collective-communication-library-2-12/

[55]Mistral. 2024. Cheaper, Better, Faster, Stronger: Continuing to push the frontier of AI and making it accessible to all. https://mistral.ai/news/mixtral-8x22b

[56] Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, LeonGao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian,Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, K. R. Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2023. Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models.http://arxiv.org/abs/2104.05158 arXiv:2104.05158 [cs].

[57]NVIDIA. 2022. Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async. https://developer.nvidia.com/blog/improving-network-performance-ofhpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/

[58]NVIDIA. 2025. NVIDIA DGX Spark: A Grace Blackwell AI supercomputer on your desk. https://www.nvidia.com/en-us/products/workstations/dgx-spark/

[59] OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o

[60] OpenAI. 2024. Introducing OpenAI o1. https://openai.com/o1/

[61] OpenAI. 2025. Introducing OpenAI o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/.[62]Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, and Dennis Cai. 2024. Alibaba HPN: A Data Center Network for Large Language Model Training. In Proceedings of the ACM SIGCOMM 2024 Conference (Sydney, NSW, Australia)(ACM SIGCOMM ’24). Association for Computing Machinery, New York, NY, USA, 691–706. https://doi.org/10.1145/3651890.3672265

[63]Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. 2024. Various lengths, constant speed: efficient language modeling with lightning attention. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria) (ICML’24). JMLR.org, Article 1688, 19 pages.

[64]Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290 [cs.LG] https://arxiv.org/ abs/2305.18290

[65]Md Shafayat Rahman, Saptarshi Bhowmik, Yevgeniy Ryasnianskiy, Xin Yuan, and Michael Lang. 2019. Topology-custom UGAL routing on dragonfly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 17, 15 pages. https://doi.org/10.1145/ 3295500.3356208

[66]Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, and Eric Chung. 2023. Microscaling Data Formats for Deep Learning. arXiv:2310.10537 [cs.LG] https://arxiv.org/abs/2310.10537

[67]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG] https://arxiv.org/abs/1707.06347

[68]ByteDance Seed. 2025. Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning. arXiv:2504.13914 [cs.CL] https://arxiv.org/abs/2504.13914

[69]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

[70]Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. CoRR abs/1911.02150 (2019). http://arxiv.org/abs/1911.02150

[71]Qwen Team. 2025. Qwen3: Think Deeper, Act Faster. https://github.com/ QwenLM/Qwen3

[72]Song Wang, Bing Yu, Wenwu Xiao, Fujun Bai, Xiaodong Long, Liang Bai, Xuerong Jia, Fengguo Zuo, Jie Tan, Yixin Guo, Peng Sun, Jun Zhou, Qiong Zhan, Sheng Hu, Yu Zhou, Yi Kang, Qiwei Ren, and Xiping Jiang. 2023. A 135 GBps/Gbit 0.66 pJ/bit Stacked Embedded DRAM with Multilayer Arrays by Fine Pitch Hybrid Bonding and Mini-TSV. In 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 1–2. https://doi.org/10.23919/ VLSITechnologyandCir57934.2023.10185427

[73]xAI. 2024. Grok-2 Beta Release. https://x.ai/news/grok-2.

[74]xAI. 2024. Our Gigafactory of Compute:Colossus. https://x.ai/colossus.

[75]An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2024. Qwen2.5 Technical Report. arXiv preprint arXiv:2412.15115 (2024).

[76]Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. https://arxiv.org/abs/2502.11089

[77]Chenggang Zhao, Liang Zhao, Jiashi Li, and Zhean Xu. 2025. DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling. https://github.com/deepseek-ai/DeepGEMM.

[78]Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library. https://github.com/deepseek-ai/DeepEP.

[79]Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, and Xin Liu. 2025. TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives. arXiv:2503.20313 [cs.DC] https://arxiv.org/abs/ 2503.20313

[80]Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 193–210. https://www.usenix.org/conference/osdi24/ presentation/zhong-yinmi