Contact

hguyue1@gmail.com

Bio

I am currently an engineer at NVIDIA working on GPU architecture and systems for deep learning. Previously I obtained my Ph.D. and Masters from University of California Santa Barbara where I worked with Prof. Zheng Zhang, Prof. Yufei Ding and Prof. Yuan Xie. My PhD research is about deep learning systems and architecture, particularly focused on DL compiler and DL sparsity. I received my B.E. from Department of Electronic Engineering, Tsinghua University in Beijing, China.

[Google Scholar page]

Selected Publications

[ISCA’25] Guyue Huang, Hao Li, Le Qin, Jiayi Huang, Yangwook Kang, Yufei Ding, Yuan Xie. TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model. Proceedings of the 52nd Annual International Symposium on Computer Ar- chitecture (ISCA ’25), 2025.

[MICRO’23] Guyue Huang, Zhengyang Wang, Po-An Tsai, Chen Zhang, Yufei Ding, Yuan Xie. RM-STC: Row-Merge Dataflow Inspired GPU Sparse Tensor Core for Energy-Efficient Sparse Acceleration. 56th IEEE/ACM International Symposium on Microarchitecture (MICRO-56), 2023.

[MLsys’23] Guyue Huang, Yang Bai, Liu Liu, Yuke Wang, Bei Yu, Yufei Ding, Yuan Xie. ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs. Machine Learning and Systems, 2023. [preprint][code]

[DAC’22] Guyue Huang, Haoran Li, Minghai Qin, Fei Sun, Yufei Ding and Yuan Xie. Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning. DAC’22 [preprint][code][bibtex]

[SC’20] Guyue Huang, Guohao Dai, Yu Wang and Huazhong Yang. GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks. The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2020. [preprint][code][bibtex]

Job Experience

May 2024 - now, NVIDIA. Deep Learning Architect.
2023 summer. NVIDIA. Deep Learning Architect Intern.
2022 summer. NVIDIA. Deep Learning Architect Intern.

Research Experiences

My PhD research is about supporting sparsity in AI/DL on GPU. Sparisty is a fascinating feature in modern deep learning that is both highly potential and extremely difficult for hardware to tackle. I investigate software and architecture methods to empower many forms of sparsity including weight sparsity, activation sparsity, graphs, embedding layers, and MoE. Refer to RM-STC (to appear, MICRO 2023), Shfl-BW (DAC’22), DA-SpMM (DAC’22) and GE-SpMM (SC’20).

I also do research about deep learning compiler. I am interested in how to integrate advanced hardware features and analytical performance models into DL compilers to close the gap between compiler generated and manually developed kernels on DL accelerators. I mainly work on the TVM stack. My recent work ALCOP(MLSys’23) studies how to realize load-compute pipelining via compiler automation.

Awards

MLSys’23 Student Travel Grant
MLSys’22 Student Travel Grant
DAC’22 best paper nomination
ACM Student Research Competition (2020) Graduate 3rd Place [webpage]

Academic Services

Transaction on Computers paper reviewer
ISCA’23 Artifact Evaluation Reviewer
DAC’23 Reviewer
MLSys’23 Extended Review Committee

Talks

(2024/2) Talk at Meta, PyTorch Compiler team’s Technical Interlock: High-Performance Deep Learning Systems via DL Sparsity and DL Compilers.
(2023/3) Talk at Google, ML+Compiler Reading Group: Enabling Data Movement and Computation Pipelining in Deep Learning Compiler.
(2023/3) Talk at [TVMCon 2023]: Enabling Data Movement and Computation Pipelining in Deep Learning Compiler
(2022/11) Talk at AWS, Scale Team: Enabling Data Movement and Computation Pipelining in Deep Learning Compiler.
(2022/3) Talk at [GTC 2022 Spring]: dgSparseLib: New Algorithm and Adaptive Tuning for Sparse-dense Matrix Multiplication

Open-source

ALCOP

ALCOP is short for Automatic Load-COmpute Pipelining. This project presents a compiler pass based on TVM that pipelines the data movement and computation in GPU kernels. Code released at this repo. Paper at this link.

ShflBW

ShflBW is a sparse NN kernel library, and also a pattern pruning method. Code released at this repo. Paper at this link.

dgSPARSE

The dgSPARSE project contains high-performance GPU kernels for sparse matrix primitives. We provide an interface to easily replace cuSPARSE in your existing applications. It contains

GNN computational graph optimization code paper
GPU SDDMM code paper
GPU SpMM code paper