Senior Director, Research & Distinguished Engineer @ RBC Borealis
I build AI solutions that generate value for others, and conduct fundamental research and write classical style Chinese poems as creative outlets for myself.
At RBC Borealis, my job is “simple”: 1) find ways AI can create or optimize value, 2) scale teams, technology, and business to make it happen.
The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.
@article{hao2024neuzip,title={NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks},author={Hao, Yongchang and Cao, Yanshuai and Mou, Lili},journal={arXiv preprint arXiv:2410.20650},year={2024},}
NeurIPS
Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models
Sadegh Mahdavi, Raquel Aoki, Keyi Tang, and Yanshuai Cao
In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
Large Language Models (LLMs) have shown remarkable performance in various natural language tasks, but they often struggle with planning problems that require structured reasoning. To address this limitation, the conversion of planning problems into the Planning Domain Definition Language (PDDL) has been proposed as a potential solution, enabling the use of automated planners. However, generating accurate PDDL files typically demands human inputs or correction, which can be time-consuming and costly. In this paper, we propose a novel approach that leverages LLMs and environment feedback to automatically generate PDDL domain and problem description files without the need for human intervention. Our method introduces an iterative refinement process that generates multiple problem PDDL candidates and progressively refines the domain PDDL based on feedback obtained from interacting with the environment. To guide the refinement process, we develop an Exploration Walk (EW) metric, which provides rich feedback signals for LLMs to update the PDDL file. We evaluate our approach on PDDL environments. We achieve an average task solve rate of 66% compared to a 29% solve rate by GPT-4’s intrinsic planning with chain-of-thought prompting. Our work enables the automated modeling of planning environments using LLMs and environment feedback, eliminating the need for human intervention in the PDDL translation process and paving the way for more reliable LLM agents in challenging problems. Our code is available at https://github.com/BorealisAI/llm-pddl-planning
@inproceedings{mahdavi2024leveragingenvironmentinteractionautomated,title={Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models},author={Mahdavi, Sadegh and Aoki, Raquel and Tang, Keyi and Cao, Yanshuai},booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},year={2024},archiveprefix={arXiv},primaryclass={cs.LG},url={https://arxiv.org/abs/2407.12979},}
NeurIPS
Do LLMs Build World Representations? Probing Through the Lens of State Abstraction
Zichao Li, Yanshuai Cao, and Jackie CK Cheung
In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
How do large language models (LLMs) encode the state of the world, including the status of entities and their relations, as described by a text? While existing work directly probes for a complete state of the world, our research explores whether and how LLMs abstract this world state in their internal representations. We propose a new framework for probing for world representations through the lens of state abstraction theory from reinforcement learning, which emphasizes different levels of abstraction, distinguishing between general abstractions that facilitate predicting future states and goal-oriented abstractions that guide the subsequent actions to accomplish tasks. To instantiate this framework, we design a text-based planning task, where an LLM acts as an agent in an environment and interacts with objects in containers to achieve a specified goal state. Our experiments reveal that fine-tuning as well as advanced pre-training strengthens LLM-built representations’ tendency of maintaining goal-oriented abstractions during decoding, prioritizing task completion over recovery of the world’s state and dynamics.
@inproceedings{lillms,title={Do LLMs Build World Representations? Probing Through the Lens of State Abstraction},author={Li, Zichao and Cao, Yanshuai and Cheung, Jackie CK},booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},year={2024},}
ACL
Jump Starting Bandits with LLM-Generated Prior Knowledge
Parand A. Alamdari, Yanshuai Cao, and Kevin H. Wilson
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
@inproceedings{alamdari-etal-2024-jump,title={Jump Starting Bandits with {LLM}-Generated Prior Knowledge},author={Alamdari, Parand A. and Cao, Yanshuai and Wilson, Kevin H.},editor={Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung},booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},month=nov,year={2024},address={Miami, Florida, USA},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.emnlp-main.1107},doi={10.18653/v1/2024.emnlp-main.1107},pages={19821--19833},}
ICML
Flora: Low-Rank Adapters Are Secretly Gradient Compressors
Yongchang Hao, Yanshuai Cao, and Lili Mou
In Forty-first International Conference on Machine Learning, Nov 2024
Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.
@inproceedings{hao2024flora,title={Flora: Low-Rank Adapters Are Secretly Gradient Compressors},author={Hao, Yongchang and Cao, Yanshuai and Mou, Lili},booktitle={Forty-first International Conference on Machine Learning},url={https://arxiv.org/abs/2402.03293},year={2024},}
ACL
Code Generation from Natural Language with Less Prior Knowledge and More Monolingual Data
Sajad Norouzi, Keyi Tang, and Yanshuai Cao
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Aug 2021
Training datasets for semantic parsing are typically small due to the higher expertise required for annotation than most other NLP tasks. As a result, models for this application usually need additional prior knowledge to be built into the architecture or algorithm. The increased dependency on human experts hinders automation and raises the development and maintenance costs in practice. This work investigates whether a generic transformer-based seq2seq model can achieve competitive performance with minimal code-generation-specific inductive bias design. By exploiting a relatively sizeable monolingual corpus of the target programming language, which is cheap to mine from the web, we achieved 81.03% exact match accuracy on Django and 32.57 BLEU score on CoNaLa. Both are SOTA to the best of our knowledge. This positive evidence highlights a potentially easier path toward building accurate semantic parsers in practice.
@inproceedings{norouzi-etal-2021-code,title={Code Generation from Natural Language with Less Prior Knowledge and More Monolingual Data},author={Norouzi, Sajad and Tang, Keyi and Cao, Yanshuai},booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},month=aug,year={2021},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2021.acl-short.98},doi={10.18653/v1/2021.acl-short.98},pages={776--785},}
ACL
Optimizing Deeper Transformers on Small Datasets
Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon J.D. Prince, and Yanshuai Cao
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Aug 2021
It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state of the art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.
@inproceedings{xu-etal-2021-optimizing,title={Optimizing Deeper Transformers on Small Datasets},author={Xu, Peng and Kumar, Dhruv and Yang, Wei and Zi, Wenjie and Tang, Keyi and Huang, Chenyang and Cheung, Jackie Chi Kit and Prince, Simon J.D. and Cao, Yanshuai},booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},month=aug,year={2021},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2021.acl-long.163},doi={10.18653/v1/2021.acl-long.163},pages={2089--2102},}
AISTATS
Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer
Yanshuai Cao*, and Peng Xu*
In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 26–28 aug 2020
In this work, we develop a novel regularizer to improve the learning of long-range dependency of sequence data. Applied on language modelling, our regularizer expresses the inductive bias that sequence variables should have high mutual information even though the model might not see abundant observations for complex long-range dependency. We show how the "next sentence prediction (classification)" heuristic can be derived in a principled way from our mutual information estimation framework, and be further extended to maximize the mutual information of sequence variables. The proposed approach not only is effective at increasing the mutual information of segments under the learned model but more importantly, leads to a higher likelihood on holdout data, and improved generation quality. Code is releasedat https://github.com/BorealisAI/BMI.
@inproceedings{pmlr-v108-cao20a,title={Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer},author={Cao, Yanshuai and Xu, Peng},booktitle={Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics},pages={3991--4001},year={2020},editor={Chiappa, Silvia and Calandra, Roberto},volume={108},series={Proceedings of Machine Learning Research},address={Online},month={26--28 Aug},publisher={PMLR},}
ICML
Evaluating Lossy Compression Rates of Deep Generative Models
Sicong Huang*, Alireza Makhzani*, Yanshuai Cao, and Roger Grosse
In Proceedings of the 37th International Conference on Machine Learning, 13–18 jul 2020
The field of deep generative modeling has succeeded in producing astonishingly realistic-seeming images and audio, but quantitative evaluation remains a challenge. Log-likelihood is an appealing metric due to its grounding in statistics and information theory, but it can be challenging to estimate for implicit generative models, and scalar-valued metrics give an incomplete picture of a model’s quality. In this work, we propose to use rate distortion (RD) curves to evaluate and compare deep generative models. While estimating RD curves is seemingly even more computationally demanding than log-likelihood estimation, we show that we can approximate the entire RD curve using nearly the same computations as were previously used to achieve a single log-likelihood estimate. We evaluate lossy compression rates of VAEs, GANs, and adversarial autoencoders (AAEs) on the MNIST and CIFAR10 datasets. Measuring the entire RD curve gives a more complete picture than scalar-valued metrics, and we arrive at a number of insights not obtainable from log-likelihoods alone.
@inproceedings{pmlr-v119-huang20c,title={Evaluating Lossy Compression Rates of Deep Generative Models},author={Huang, Sicong and Makhzani, Alireza and Cao, Yanshuai and Grosse, Roger},booktitle={Proceedings of the 37th International Conference on Machine Learning},pages={4444--4454},year={2020},editor={III, Hal Daumé and Singh, Aarti},volume={119},series={Proceedings of Machine Learning Research},address={Virtual},month={13--18 Jul},publisher={PMLR},}
ICML
On Variational Learning of Controllable Representations for Text without Supervision
Peng Xu, Jackie Chi Kit Cheung, and Yanshuai Cao
In Proceedings of the 37th International Conference on Machine Learning, 13–18 jul 2020
The variational autoencoder (VAE) can learn the manifold of natural images on certain datasets, as evidenced by meaningful interpolating or extrapolating in the continuous latent space. However, on discrete data such as text, it is unclear if unsupervised learning can discover similar latent space that allows controllable manipulation. In this work, we find that sequence VAEs trained on text fail to properly decode when the latent codes are manipulated, because the modified codes often land in holes or vacant regions in the aggregated posterior latent space, where the decoding network fails to generalize. Both as a validation of the explanation and as a fix to the problem, we propose to constrain the posterior mean to a learned probability simplex, and performs manipulation within this simplex. Our proposed method mitigates the latent vacancy problem and achieves the first success in unsupervised learning of controllable representations for text. Empirically, our method outperforms unsupervised baselines and strong supervised approaches on text style transfer, and is capable of performing more flexible fine-grained control over text generation than existing methods.
@inproceedings{pmlr-v119-xu20a,title={On Variational Learning of Controllable Representations for Text without Supervision},author={Xu, Peng and Cheung, Jackie Chi Kit and Cao, Yanshuai},booktitle={Proceedings of the 37th International Conference on Machine Learning},pages={10534--10543},year={2020},editor={III, Hal Daumé and Singh, Aarti},volume={119},series={Proceedings of Machine Learning Research},address={Virtual},month={13--18 Jul},publisher={PMLR},url={http://proceedings.mlr.press/v119/xu20a.html},}
ICLR
Improving GAN Training via Binarized Representation Entropy (BRE) Regularization
Yanshuai Cao, Gavin Weiguang Ding, Kry Yik-Chau Lui, and Ruitong Huang
International Conference on Learning Representations, 13–18 jul 2018
We propose a novel regularizer to improve the training of Generative Adversarial Networks (GANs). The motivation is that when the discriminator D spreads out its model capacity in the right way, the learning signals given to the generator G are more informative and diverse. These in turn help G to explore better and discover the real data manifold while avoiding large unstable jumps due to the erroneous extrapolation made by D. Our regularizer guides the rectifier discriminator D to better allocate its model capacity, by encouraging the binary activation patterns on selected internal layers of D to have a high joint entropy. Experimental results on both synthetic data and real datasets demonstrate improvements in stability and convergence speed of the GAN training, as well as higher sample quality. The approach also leads to higher classification accuracies in semi-supervised learning.
@article{Cao2018Improving,title={Improving GAN Training via Binarized Representation Entropy (BRE) Regularization},author={Cao, Yanshuai and Ding, Gavin Weiguang and Lui, Kry Yik-Chau and Huang, Ruitong},journal={International Conference on Learning Representations},year={2018},}
We explore ways to scale Gaussian processes (GP) to large datasets. Two methods with different theoretical and practical motivations are proposed. The first method solves the open problem of efficient discrete inducing set selection in the context of inducing point based approximation to full GPs. When inducing points need to be chosen from the training set, the proposed method is the only principled approach to date for joint tuning of inducing set and GP hyperparameters while scaling linearly in the number of training set size during learning. Empirically it achieves a trade-off between speed and accuracy that is comparable to other state-of-arts inducing point GP methods. The second method is a novel framework for building flexible probabilistic prediction models based on GPs that is simple to parallelize and highly scalable. Referred to as transductive fusion, this second approach learns separate GP experts whose predictions are combined in ways that depend on test point locations. A number of new models are proposed in this new framework. Learning and inference in these new models are straightforwardly parallel, and predictive accuracy is shown to be satisfactory empirically.
@article{cao_yanshuai_gp_thesis,author={Cao, Yanshuai},title={Scaling Gaussian Processes},school={University of Toronto},year={2018},}
ICLR
Adversarial Manipulation of Deep Representations
Sara Sabour*, Yanshuai Cao*, Fartash Faghri, and David J. Fleet
We show that the image representations in a deep neural network (DNN) can be manipulated to mimic those of other natural images, with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image perturbations designed to produce erroneous class labels. Here we instead concentrate on the internal layers of DNN representations, to produce a new class of adversarial images that differs qualitatively from others. While the adversary is perceptually similar to one image, its internal representation appears remarkably similar to a different image, from a different class and bearing little if any apparent similarity to the input. Further, they appear generic and consistent with the space of natural images. This phenomenon demonstrates the possibility to trick a DNN to confound almost any image with any other chosen image, and raises questions about DNN representations, as well as the properties of natural images themselves.
@article{sabour15,title={Adversarial Manipulation of Deep Representations},author={Sabour, Sara and Cao, Yanshuai and Faghri, Fartash and Fleet, David J.},booktitle={4th International Conference on Learning Representations},year={2016},url={http://arxiv.org/abs/1511.05122},}
NeurIPS
Efficient Optimization for Sparse Gaussian Process Regression
Yanshuai Cao, Marcus A Brubaker, David J Fleet, and Aaron Hertzmann
In Advances in Neural Information Processing Systems, 13–18 jul 2013
We propose an efficient discrete optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates this inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in the training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-of-art performance in the discrete case and competitive results in the continuous case.
@inproceedings{NIPS2013_46922a08,author={Cao, Yanshuai and Brubaker, Marcus A and Fleet, David J and Hertzmann, Aaron},booktitle={Advances in Neural Information Processing Systems},editor={Burges, C. J. C. and Bottou, L. and Welling, M. and Ghahramani, Z. and Weinberger, K. Q.},pages={1097--1105},publisher={Curran Associates, Inc.},title={Efficient Optimization for Sparse Gaussian Process Regression},volume={26},year={2013},}