hguyue1@gmail.com
I am currently an engineer at NVIDIA working on GPU architecture and systems for deep learning. Previously I obtained my Ph.D. and Masters from University of California Santa Barbara where I worked with Prof. Zheng Zhang, Prof. Yufei Ding and Prof. Yuan Xie. My PhD research is about deep learning systems and architecture, particularly focused on DL compiler and DL sparsity. I received my B.E. from Department of Electronic Engineering, Tsinghua University in Beijing, China.
[MICRO’23] Guyue Huang, Zhengyang Wang, Po-An Tsai, Chen Zhang, Yufei Ding, Yuan Xie. RM-STC: Row-Merge Dataflow Inspired GPU Sparse Tensor Core for Energy-Efficient Sparse Acceleration. To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO-56), 2023.
[MLsys’23] Guyue Huang, Yang Bai, Liu Liu, Yuke Wang, Bei Yu, Yufei Ding, Yuan Xie. ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs. Machine Learning and Systems, 2023. [preprint][code]
[DAC’22] Guyue Huang, Haoran Li, Minghai Qin, Fei Sun, Yufei Ding and Yuan Xie. Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning. DAC’22 [preprint][code][bibtex]
[ACM-SRC’21 Poster] Guyue Huang, Guohao Dai, Yu Wang, Yufei Ding and Yuan Xie. Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction. 2021. ACM Student Research Competition (SRC), Graduate 3rd Place (https://src.acm.org)
[SC’20] Guyue Huang, Guohao Dai, Yu Wang and Huazhong Yang. GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks. The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2020. [preprint][code][bibtex]
My PhD research is about supporting sparsity in AI/DL on GPU. Sparisty is a fascinating feature in modern deep learning that is both highly potential and extremely difficult for hardware to tackle. I investigate software and architecture methods to empower many forms of sparsity including weight sparsity, activation sparsity, graphs, embedding layers, and MoE. Refer to RM-STC (to appear, MICRO 2023), Shfl-BW (DAC’22), DA-SpMM (DAC’22) and GE-SpMM (SC’20).
I also do research about deep learning compiler. I am interested in how to integrate advanced hardware features and analytical performance models into DL compilers to close the gap between compiler generated and manually developed kernels on DL accelerators. I mainly work on the TVM stack. My recent work ALCOP(MLSys’23) studies how to realize load-compute pipelining via compiler automation.
ALCOP is short for Automatic Load-COmpute Pipelining. This project presents a compiler pass based on TVM that pipelines the data movement and computation in GPU kernels. Code released at this repo. Paper at this link.
ShflBW is a sparse NN kernel library, and also a pattern pruning method. Code released at this repo. Paper at this link.
The dgSPARSE project contains high-performance GPU kernels for sparse matrix primitives. We provide an interface to easily replace cuSPARSE in your existing applications. It contains