Guolian Minsheng: Super node boosts computational efficiency improvement, domestic computing power may welcome the opportunity for "overtaking on a curve"
In the era of AI Agent, the non-linear growth of token demand may directly lead to AI computing power requirements exceeding expectations, with super nodes expected to become an important trend in AI computing power development; domestic super nodes are currently rapidly developing and are poised to become an important opportunity for domestic computing power to achieve a "curve overtaking" moment.
Guolian Minsheng Securities released a research report stating that the demand for AI tokens in the AI Agent era is experiencing nonlinear growth, which may directly lead to unexpected AI computing power demand, and super nodes are expected to become an important trend in the development of AI computing power; domestic super nodes are currently developing rapidly and are expected to become an important opportunity for domestic computing power to achieve "overtaking on the curve." Recommendations focus on: 1) Leading domestic super nodes: Inspur Electronic Information Industry (000977.SZ), Dawning Information Industry (603019.SH), etc.; 2) Huawei super node industry chain: iSoftStone Information Technology (301236.SZ), Digital China Group (000034.SZ), China Greatwall Technology Group (000066.SZ), Hydsoft Technology (301316.SZ), Talkweb Information System (002261.SZ), etc.; 3) Domestic AI chips/CPU: Cambricon (688256.SH), Hygon Information Technology (688041.SH), China Greatwall Technology Group, Horizon Robotics (688343.SH), Loongson Technology Corporation (688047.SH), etc.; 4) Cloud computing: KINGSOFT CLOUD (03896), Wangsu Science & Technology (300017.SZ), UCloud (688158.SH), QingCloud Technology (688316.SH), etc.
Guolian Minsheng Securities' main points are as follows:
AI development promotes architecture innovation in computing power, and super nodes help improve computing efficiency
AI development promotes architecture innovation in computing power: AI computing differs from traditional data center computing in that it is a continuous online intelligent production system. Its core performance depends on reasoning, context processing, and data movement efficiency, rather than just server peak computing power. AI workloads need to perform multi-step reasoning in super long contexts, putting pressure on platform capabilities at all levels. Tiny efficiency losses can severely affect costs, throughput, and competitiveness at scale. Progress in AI computing can be demonstrated by three expansion laws: pre-training expansion allows models to learn inherent knowledge, post-training expansion gives models thinking abilities through fine-tuning and reinforcement learning, and testing expansion generates more tokens during inference to achieve deep reasoning.
Large model self-regressive inference involves two conflicting stages of Prefill (compute-intensive) and Decode (memory bandwidth-intensive). Super nodes can become important supports for achieving P/D separation. Therefore, they are expected to become the core form of the next generation AI computing power architecture. In the Decode stage, the key factor determining performance is no longer the peak computing power of the GPU, but the total amount of data it can read from or write to memory in a unit of time. This performance directly affects another core user experience metric, the single token generation delay, which determines the fluency of subsequent text generation. Therefore, the P/D separation architecture emerges, with the new type of super node server architecture achieving efficient physical separation and dividing the P/D tasks internally with a powerful internal interconnect network. At the same time, on the interconnect protocol side, advancements in technology can more effectively release the physical bandwidth capabilities. For example, the OISA protocol promoted by China Mobile Limited has surpassed its role as a mere "data pipeline" and is evolving towards an active participant in system management.
Taking the NVIDIA Rubin platform as an example, collaborative design is the foundation of the Rubin platform. GPU, CPU, network, security, software, power supply, and cooling are all built as a unified system rather than being optimized independently. In this way, the Rubin platform views the entire data center (not just individual GPU servers) as a computing unit. This method lays a new foundation for efficient, secure, and predictable large-scale intelligent generation, ensuring that performance and efficiency are sustained in actual production deployments, not just in isolated component benchmark tests.
The flagship product of the Rubin platform is the Vera Rubin NVL72 rack-level system, designed to run the entire rack as a coordinated machine in larger AI factories. The NVL72 system is optimized not only for peak performance but also for continuous intelligent production: it has predictable latency, high utilization of heterogeneous execution stages, and the ability to efficiently convert power into usable intelligence.
The Rubin platform is built with six new chips, each designed for specific roles in AI factories, and from the beginning, they are intended to work together as part of a unified rack-level system. 1) NVIDIA Vera CPU: 88 NVIDIA custom-designed OLYMPUS cores optimized for the new generation AI factory fully compatible with Arm. 2) NVIDIA Rubin GPU: Equipped with HBM4 and the new NVIDIA Transformer engine for high-performance AI computing. 3) NVIDIA NVLink 6 switch: The sixth-generation vertical expansion network provides up to 3.6 TB/s of GPU-to-GPU bandwidth. 4) NVIDIA ConnectX-9: Endpoint high-throughput, low-latency network interface supporting large-scale horizontal expansion of AI applications. 5) NVIDIA BlueField-4 data processor (DPU): Dual bare-chip packaging integrating a 64-core NVIDIA Grace CPU for infrastructure offloading and security processing. Built-in NVIDIA ConnectX-9 high-speed network chip for efficient close data transmission. 6) NVIDIA Spectrum-6 Ethernet switch: Uses electro-optical integrated packaging technology to improve the efficiency and reliability of horizontal expansion connections.
Domestic super nodes are accelerating development and are expected to become an important opportunity for domestic computing power to overtake in the curve.
Inspur Electronic Information Industry: The MetaSD200 super node is one of the strongest domestic AI super node products for large-scale model inference in China. The MetaSD200 super node, running DeepSeek R1 671B large model with 64 local AI chips, achieves a single-user token generation speed of 112 tokens/s in a scenario with input length of 4096 and output length of 1024, with a single token generation delay as low as 8.9ms. It is the first domestic super node product to break the 10ms mark, leading the industry in end-to-end large model inference experience.
Achieving hardware architecture native innovation: Self-developed multi-host low-latency memory semantic communication architecture, using 3D Mesh high-performance interconnect super extendable system supporting high-density expansion of 64 local AI chips, maximum system memory of 4TB, and total memory of 64TB. Innovative three-layer simplified interconnect protocol, with message data utilization rate exceeding 96% and physical layer error rate as low as 10^-12; pioneering global unified memory addressing and shadow device technology development for cross-host GPU P2P direct access.
Optimizing communication capabilities and enhancing GPU interaction capabilities. 1) Simplified interconnect protocol: using a three-layer simplified interconnect protocol consisting of transaction layer, data link layer, and physical layer. The transaction layer naturally supports Load/Store memory semantics; the data link layer supports credit-based flow control mechanism and link-level error retransmission guarantee; the physical layer establishes a high-reliability physical channel with a low error rate of 10^-12, achieving an effective data utilization rate of over 96%. 2) Global unified addressing: to solve the communication challenges across host domains, design an independent exchange domain global address space that allows multiple GPUs in different host domains to have unified memory addressing in the exchange domain, providing a basic guarantee for GPU mutual access. 3) Global address mapping and data routing: Innovative development of shadow device technology, mapping remote GPUs to the local host domain through shadow devices to enable all independent hosts to access global GPU memory, and achieve cross-host P2P access through efficient port forwarding technology.
Dawning Information Industry: A fully coverage product matrix for all scenarios, an open ecosystem to help advance domestic computing power. Dawning Information Industry has launched the world's first wireless cable box-type super node scaleX40. scaleX40 adopts an orthogonal wireless cable first-level interconnect architecture, allowing the computing nodes to be directly plugged into the switch nodes, fundamentally eliminating the performance loss and operational risks caused by cables.
A single scaleX40 node integrates 40 GPUs, with a total computing power exceeding 28PFLOPS (FP8 precision), total HBM memory exceeding 5TB, total access bandwidth exceeding 80TB/s, forming a high-density computing unit to meet the training and inference needs of trillion-parameter large models.
At the deployment level, the product adopts a standard 19-inch box design, decoupling the computing unit from the cabinet, significantly shortening the deployment cycle from months to hours, and significantly improving delivery efficiency; system reliability reaches 99.99%, optimizing signal loss and overall energy consumption in high-density scenarios, effectively reducing long-term operational costs.
Huawei Ascend: From 384 super nodes to a million-card cluster, the domestic computing leader is building a solid computing foundation.
Atlas 900 AI super node: Equipped with 384 Ascend 910C AI chips, total computing power reaches 300PFLOPS (FP8 precision), using self-developed LingQuan 1.0 full optical interconnect protocol, it is the mainstream deployment computing power product in domestic smart computing centers and large-scale model training scenarios.
Atlas 950 AI super node: A flagship super node designed for trillion-parameter large model training and inference scenarios, equipped with 8192 Ascend 950DT AI chips, total computing power reaches 8EFLOPS in FP8 precision and 16EFLOPS in FP4 precision, with a total memory capacity of 1152TB; the interconnect architecture is upgraded to LingQuan 2.0 protocol, with a total interconnect bandwidth of 16.3PB/s.
Atlas 960 AI super node: A large-scale flagship super node designed for AGI scenarios, equipped with 15488 Ascend 960 AI chips, total computing power reaches 30EFLOPS in FP8 precision and 60EFLOPS in FP4 precision, with a total memory capacity of 4460TB; using LingQuan 2.0 interconnect protocol, the total interconnect bandwidth is upgraded to 34PB/s; it can form the largest million-card SuperCluster cluster, with total computing power of 2ZFLOPS in FP8 and 4ZFLOPS in FP4.
TaiShan 950 Universal Computing Super Node: A super node product designed for general computing scenarios such as finance and government, supporting up to 32 Kunpeng 950 general-purpose processors, with a maximum system memory of 48TB, supporting memory/SSD/DPU pooling.
Related Articles

Huaxi: Macro risk appetite rebounds, bulk metal related targets are expected to benefit.

After repeated fluctuations, A-shares are expected to rise again. What are the investment themes? The strategies of the top ten securities firms have arrived.

CITIC SEC: The market will focus on narrowing the circle and pay attention to four directions of AI hardware, resources, cyclical price increases, and dividends.
Huaxi: Macro risk appetite rebounds, bulk metal related targets are expected to benefit.

After repeated fluctuations, A-shares are expected to rise again. What are the investment themes? The strategies of the top ten securities firms have arrived.

CITIC SEC: The market will focus on narrowing the circle and pay attention to four directions of AI hardware, resources, cyclical price increases, and dividends.

RECOMMEND





