publications | Yanshuai Cao 曹颜帅

2025

Preprint

Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization

Yuqiao Wen, Yanshuai Cao, and Lili Mou

2025

Abs arXiv

Large language models have been increasing in size due to their success in a wide range of applications. This calls for a pressing need to reduce memory usage to make them more accessible. Post-training quantization is a popular technique which uses fewer bits (e.g., 4–8 bits) to represent the model without retraining it. However, it remains a challenging task to perform quantization in an ultra-low-bit setup (e.g., 2 bits). In this paper, we propose InvarExplore, a unified framework that systematically explores different model invariance at the same time, allowing us to take advantage of the synergy between each type of invariance. Importantly, InvarExplore features a discrete search algorithm that enables us to explore permutation invariance, which is under-studied as it cannot be optimized with gradient-based methods. Results show that InvarExplore is compatible with existing state-of-the-art methods, achieving an add-on performance improvement over strong competing methods.
AAAI
EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Yuqiao Wen, Behzad Shayegh, Chenyang Huang, Yanshuai Cao, and Lili Mou

The 39th Annual AAAI Conference on Artificial Intelligence, 2025

Abs arXiv Bib

The ability of zero-shot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but they are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations as well as existing ensemble techniques. Further, we can distill the ensemble’s knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-based distillation does not sacrifice, or even improves, the translation quality.
@article{wen2024ebbsensemblebilevelbeam, title = {EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation}, author = {Wen, Yuqiao and Shayegh, Behzad and Huang, Chenyang and Cao, Yanshuai and Mou, Lili}, year = {2025}, journal = {The 39th Annual AAAI Conference on Artificial Intelligence}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2403.00144}, }

2024

Preprint
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Yongchang Hao, Yanshuai Cao, and Lili Mou

arXiv preprint arXiv:2410.20650, 2024

Abs arXiv Bib

The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.
@article{hao2024neuzip, title = {NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks}, author = {Hao, Yongchang and Cao, Yanshuai and Mou, Lili}, journal = {arXiv preprint arXiv:2410.20650}, year = {2024}, }
NeurIPS
Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models

Sadegh Mahdavi, Raquel Aoki, Keyi Tang, and Yanshuai Cao

In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Abs arXiv Bib HTML Code

Large Language Models (LLMs) have shown remarkable performance in various natural language tasks, but they often struggle with planning problems that require structured reasoning. To address this limitation, the conversion of planning problems into the Planning Domain Definition Language (PDDL) has been proposed as a potential solution, enabling the use of automated planners. However, generating accurate PDDL files typically demands human inputs or correction, which can be time-consuming and costly. In this paper, we propose a novel approach that leverages LLMs and environment feedback to automatically generate PDDL domain and problem description files without the need for human intervention. Our method introduces an iterative refinement process that generates multiple problem PDDL candidates and progressively refines the domain PDDL based on feedback obtained from interacting with the environment. To guide the refinement process, we develop an Exploration Walk (EW) metric, which provides rich feedback signals for LLMs to update the PDDL file. We evaluate our approach on PDDL environments. We achieve an average task solve rate of 66% compared to a 29% solve rate by GPT-4’s intrinsic planning with chain-of-thought prompting. Our work enables the automated modeling of planning environments using LLMs and environment feedback, eliminating the need for human intervention in the PDDL translation process and paving the way for more reliable LLM agents in challenging problems. Our code is available at https://github.com/BorealisAI/llm-pddl-planning
@inproceedings{mahdavi2024leveragingenvironmentinteractionautomated, title = {Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models}, author = {Mahdavi, Sadegh and Aoki, Raquel and Tang, Keyi and Cao, Yanshuai}, booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year = {2024}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2407.12979}, }
NeurIPS
Do LLMs Build World Representations? Probing Through the Lens of State Abstraction

Zichao Li, Yanshuai Cao, and Jackie CK Cheung

In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Abs Bib HTML Code

How do large language models (LLMs) encode the state of the world, including the status of entities and their relations, as described by a text? While existing work directly probes for a complete state of the world, our research explores whether and how LLMs abstract this world state in their internal representations. We propose a new framework for probing for world representations through the lens of state abstraction theory from reinforcement learning, which emphasizes different levels of abstraction, distinguishing between general abstractions that facilitate predicting future states and goal-oriented abstractions that guide the subsequent actions to accomplish tasks. To instantiate this framework, we design a text-based planning task, where an LLM acts as an agent in an environment and interacts with objects in containers to achieve a specified goal state. Our experiments reveal that fine-tuning as well as advanced pre-training strengthens LLM-built representations’ tendency of maintaining goal-oriented abstractions during decoding, prioritizing task completion over recovery of the world’s state and dynamics.
@inproceedings{lillms, title = {Do LLMs Build World Representations? Probing Through the Lens of State Abstraction}, author = {Li, Zichao and Cao, Yanshuai and Cheung, Jackie CK}, booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year = {2024}, }
ACL
Jump Starting Bandits with LLM-Generated Prior Knowledge

Parand A. Alamdari, Yanshuai Cao, and Kevin H. Wilson

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs DOI arXiv Bib HTML Code

We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
@inproceedings{alamdari-etal-2024-jump, title = {Jump Starting Bandits with {LLM}-Generated Prior Knowledge}, author = {Alamdari, Parand A. and Cao, Yanshuai and Wilson, Kevin H.}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.emnlp-main.1107}, doi = {10.18653/v1/2024.emnlp-main.1107}, pages = {19821--19833}, }
ICML
Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Yongchang Hao, Yanshuai Cao, and Lili Mou

In Forty-first International Conference on Machine Learning, Nov 2024

Abs arXiv Bib Blog Code

Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.
@inproceedings{hao2024flora, title = {Flora: Low-Rank Adapters Are Secretly Gradient Compressors}, author = {Hao, Yongchang and Cao, Yanshuai and Mou, Lili}, booktitle = {Forty-first International Conference on Machine Learning}, url = {https://arxiv.org/abs/2402.03293}, year = {2024}, }

Preprint

Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks

Yongchang Hao, Yanshuai Cao, and Lili Mou

arXiv preprint arXiv:2402.03295, Nov 2024

arXiv Bib

@article{hao2024gingerefficientcurvatureapproximation,
  title = {Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks},
  author = {Hao, Yongchang and Cao, Yanshuai and Mou, Lili},
  journal = {arXiv preprint arXiv:2402.03295},
  year = {2024},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  url = {https://arxiv.org/abs/2402.03295}
}

ICLR
Ensemble distillation for unsupervised constituency parsing

Behzad Shayegh, Yanshuai Cao, Xiaodan Zhu, Jackie CK Cheung, and Lili Mou

International Conference on Learning Representations, Nov 2024

Abs arXiv Bib HTML Code

We investigate the unsupervised constituency parsing task, which organizes words and phrases of a sentence into a hierarchical structure without using linguistically annotated data. We observe that existing unsupervised parsers capture different aspects of parsing structures, which can be leveraged to enhance unsupervised parsing performance. To this end, we propose a notion of "tree averaging," based on which we further propose a novel ensemble method for unsupervised parsing. To improve inference efficiency, we further distill the ensemble knowledge into a student model; such an ensemble-then-distill process is an effective approach to mitigate the over-smoothing problem existing in common multi-teacher distilling methods. Experiments show that our method surpasses all previous approaches, consistently demonstrating its effectiveness and robustness across various runs, with different ensemble components, and under domain-shift conditions.
@article{shayegh2023ensemble, title = {Ensemble distillation for unsupervised constituency parsing}, author = {Shayegh, Behzad and Cao, Yanshuai and Zhu, Xiaodan and Cheung, Jackie CK and Mou, Lili}, journal = {International Conference on Learning Representations}, year = {2024} }

Patent

System and method for improved neural network training

CAO Yanshuai, Yik Chau Lui, Weiguang Ding, and Ruitong Huang

Aug 2024

US Patent 12,056,605

Bib

@misc{yanshuai2024system,
  title = {System and method for improved neural network training},
  author = {Yanshuai, CAO and Lui, Yik Chau and Ding, Weiguang and Huang, Ruitong},
  year = {2024},
  month = aug,
  note = {US Patent 12,056,605}
}

Patent

System and method for machine learning architecture for partially-observed multimodal data

Yu Gong, Jiawei He, Thibaut Durand, Megha Nawhal, CAO Yanshuai, MORI Gregory, and Seyed Hossein Hajimirsadeghi

Jul 2024

US Patent 12,033,083

Bib

@misc{gong2024system,
  title = {System and method for machine learning architecture for partially-observed multimodal data},
  author = {Gong, Yu and He, Jiawei and Durand, Thibaut and Nawhal, Megha and Yanshuai, CAO and Gregory, MORI and Hajimirsadeghi, Seyed Hossein},
  year = {2024},
  month = jul,
  note = {US Patent 12,033,083}
}

Patent

System and method for machine learning architecture with variational autoencoder pooling

Teng Long, CAO Yanshuai, and Jackie CK Cheung

Feb 2024

US Patent 11,914,955

Bib

@misc{long2024system,
  title = {System and method for machine learning architecture with variational autoencoder pooling},
  author = {Long, Teng and Yanshuai, CAO and Cheung, Jackie CK},
  year = {2024},
  month = feb,
  note = {US Patent 11,914,955}
}

Patent

Transformer-based architecture for density ratio estimation

TANG Keyi, and CAO Yanshuai

May 2024

US Patent App. 18/491,417

Bib

@misc{keyi2024transformer,
  title = {Transformer-based architecture for density ratio estimation},
  author = {Keyi, TANG and Yanshuai, CAO},
  year = {2024},
  month = may,
  note = {US Patent App. 18/491,417}
}

2023

ICLR

An Equal-Size Hard EM Algorithm for Diverse Dialogue Generation

Yuqiao Wen, Yongchang Hao, Yanshuai Cao, and Lili Mou

International Conference on Learning Representations, May 2023

arXiv Bib HTML

@article{wen2023equalsizehardemalgorithm,
  title = {An Equal-Size Hard EM Algorithm for Diverse Dialogue Generation},
  author = {Wen, Yuqiao and Hao, Yongchang and Cao, Yanshuai and Mou, Lili},
  year = {2023},
  journal = {International Conference on Learning Representations},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2209.14627},
}

Patent

Method and device for conducting measurements for an N-dimensional data structure

Weiguang Ding, Ruitong Huang, Luyu Wang, and CAO Yanshuai

Jan 2023

US Patent 11,551,041

Bib

@misc{ding2023method,
  title = {Method and device for conducting measurements for an N-dimensional data structure},
  author = {Ding, Weiguang and Huang, Ruitong and Wang, Luyu and Yanshuai, CAO},
  year = {2023},
  month = jan,
  note = {US Patent 11,551,041}
}

Patent

Robust pruned neural networks via adversarial training

Luyu Wang, Weiguang Ding, Ruitong Huang, CAO Yanshuai, and Yik Chau Lui

Jan 2023

US Patent 11,562,244

Bib

@misc{wang2023robust,
  title = {Robust pruned neural networks via adversarial training},
  author = {Wang, Luyu and Ding, Weiguang and Huang, Ruitong and Yanshuai, CAO and Lui, Yik Chau},
  year = {2023},
  month = jan,
  note = {US Patent 11,562,244}
}

Patent

System and method for improving deep neural network performance

CAO Yanshuai, Ruitong Huang, and Junfeng Wen

Sep 2023

US Patent 11,755,916

Bib

@misc{yanshuai2023system,
  title = {System and method for improving deep neural network performance},
  author = {Yanshuai, CAO and Huang, Ruitong and Wen, Junfeng},
  year = {2023},
  month = sep,
  note = {US Patent 11,755,916}
}

Patent

System and method for machine learning with long-range dependency

CAO Yanshuai, and Peng Xu

Sep 2023

US Patent 11,763,129

Bib

@misc{yanshuai2023systen,
  title = {System and method for machine learning with long-range dependency},
  author = {Yanshuai, CAO and Xu, Peng},
  year = {2023},
  month = sep,
  note = {US Patent 11,763,129}
}

Patent

System and method for controllable machine text generation architecture

Peng Xu, CAO Yanshuai, and Jackie CK Cheung

Sep 2023

US Patent 11,763,100

Bib

@misc{xu2023system,
  title = {System and method for controllable machine text generation architecture},
  author = {Xu, Peng and Yanshuai, CAO and Cheung, Jackie CK},
  year = {2023},
  month = sep,
  note = {US Patent 11,763,100}
}

Patent

System and method for machine learning architecture with variational hyper-RNN

DENG Ruizhi, CAO Yanshuai, Bo Chang, and Marcus Brubaker

Mar 2023

US Patent 11,615,305

Bib

@misc{ruizhi2023system,
  title = {System and method for machine learning architecture with variational hyper-RNN},
  author = {Ruizhi, DENG and Yanshuai, CAO and Chang, Bo and Brubaker, Marcus},
  year = {2023},
  month = mar,
  note = {US Patent 11,615,305}
}

2022

Patent

System and method for cross-domain transferable neural coherence model

CAO Yanshuai, Hamidreza SAGHIR, Jin Sung KANG, Teng Long, Jackie CK CHEUNG, and others

Mar 2022

US Patent 11,270,072

Bib

@misc{yanshuai2022system,
  title = {System and method for cross-domain transferable neural coherence model},
  author = {Yanshuai, CAO and SAGHIR, Hamidreza and KANG, Jin Sung and Long, Teng and CHEUNG, Jackie CK and others},
  year = {2022},
  month = mar,
  note = {US Patent 11,270,072}
}

Patent

System and method for transferable natural language interface

CAO Yanshuai, Peng Xu, TANG Keyi, Wei Yang, ZI Wenjie, Teng Long, Jackie Chit Kit Cheung, Chenyang Huang, MOU Lili, Hamidreza Shahidi, and others

Apr 2022

US Patent App. 17/508,914

Bib

@misc{yanshuai2022systen,
  title = {System and method for transferable natural language interface},
  author = {Yanshuai, CAO and Xu, Peng and Keyi, TANG and Yang, Wei and Wenjie, ZI and Long, Teng and Cheung, Jackie Chit Kit and Huang, Chenyang and Lili, MOU and Shahidi, Hamidreza and others},
  year = {2022},
  month = apr,
  note = {US Patent App. 17/508,914}
}

2021

Preprint
Hierarchical Neural Data Synthesis for Semantic Parsing

Wei Yang, Peng Xu, and Yanshuai Cao

arXiv preprint arXiv:2112.02212, Apr 2021

Abs arXiv Bib

Semantic parsing datasets are expensive to collect. Moreover, even the questions pertinent to a given domain, which are the input of a semantic parsing system, might not be readily available, especially in cross-domain semantic parsing. This makes data augmentation even more challenging. Existing methods to synthesize new data use hand-crafted or induced rules, requiring substantial engineering effort and linguistic expertise to achieve good coverage and precision, which limits the scalability. In this work, we propose a purely neural approach of data augmentation for semantic parsing that completely removes the need for grammar engineering while achieving higher semantic parsing accuracy. Furthermore, our method can synthesize in the zero-shot setting, where only a new domain schema is available without any input-output examples of the new domain. On the Spider cross-domain text-to-SQL semantic parsing benchmark, we achieve the state-of-the-art performance on the development set (77.2% accuracy) using our zero-shot augmentation.
@article{yang2021hierarchicalneuraldatasynthesis, title = {Hierarchical Neural Data Synthesis for Semantic Parsing}, author = {Yang, Wei and Xu, Peng and Cao, Yanshuai}, journal = {arXiv preprint arXiv:2112.02212}, year = {2021}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2112.02212}, }
ACL
Code Generation from Natural Language with Less Prior Knowledge and More Monolingual Data

Sajad Norouzi, Keyi Tang, and Yanshuai Cao

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Aug 2021

Abs DOI Bib HTML Code

Training datasets for semantic parsing are typically small due to the higher expertise required for annotation than most other NLP tasks. As a result, models for this application usually need additional prior knowledge to be built into the architecture or algorithm. The increased dependency on human experts hinders automation and raises the development and maintenance costs in practice. This work investigates whether a generic transformer-based seq2seq model can achieve competitive performance with minimal code-generation-specific inductive bias design. By exploiting a relatively sizeable monolingual corpus of the target programming language, which is cheap to mine from the web, we achieved 81.03% exact match accuracy on Django and 32.57 BLEU score on CoNaLa. Both are SOTA to the best of our knowledge. This positive evidence highlights a potentially easier path toward building accurate semantic parsers in practice.
@inproceedings{norouzi-etal-2021-code, title = {Code Generation from Natural Language with Less Prior Knowledge and More Monolingual Data}, author = {Norouzi, Sajad and Tang, Keyi and Cao, Yanshuai}, booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics}, month = aug, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.acl-short.98}, doi = {10.18653/v1/2021.acl-short.98}, pages = {776--785}, }
ACL
Optimizing Deeper Transformers on Small Datasets

Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon J.D. Prince, and Yanshuai Cao

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Aug 2021

Abs DOI Bib HTML Code

It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state of the art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.
@inproceedings{xu-etal-2021-optimizing, title = {Optimizing Deeper Transformers on Small Datasets}, author = {Xu, Peng and Kumar, Dhruv and Yang, Wei and Zi, Wenjie and Tang, Keyi and Huang, Chenyang and Cheung, Jackie Chi Kit and Prince, Simon J.D. and Cao, Yanshuai}, booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics}, month = aug, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.acl-long.163}, doi = {10.18653/v1/2021.acl-long.163}, pages = {2089--2102}, }
ACL-Demo
TURING: an Accurate and Interpretable Multi-Hypothesis Cross-Domain Natural Language Database Interface

Peng Xu, Wenjie Zi, Hamidreza Shahidi, Ákos Kádár, Keyi Tang, Wei Yang, Jawad Ateeq, Harsh Barot, Meidan Alon, and Yanshuai Cao

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Aug 2021

Abs DOI Bib HTML

A natural language database interface (NLDB) can democratize data-driven insights for non-technical users. However, existing Text-to-SQL semantic parsers cannot achieve high enough accuracy in the cross-database setting to allow good usability in practice. This work presents TURING, a NLDB system toward bridging this gap. The cross-domain semantic parser of TURING with our novel value prediction method achieves 75.1% execution accuracy, and 78.3% top-5 beam execution accuracy on the Spider validation set (Yu et al., 2018b). To benefit from the higher beam accuracy, we design an interactive system where the SQL hypotheses in the beam are explained step-by-step in natural language, with their differences highlighted. The user can then compare and judge the hypotheses to select which one reflects their intention if any. The English explanations of SQL queries in TURING are produced by our high-precision natural language generation system based on synchronous grammars.
@inproceedings{xu-etal-2021-turing, title = {{TURING}: an Accurate and Interpretable Multi-Hypothesis Cross-Domain Natural Language Database Interface}, author = {Xu, Peng and Zi, Wenjie and Shahidi, Hamidreza and K{\'a}d{\'a}r, {\'A}kos and Tang, Keyi and Yang, Wei and Ateeq, Jawad and Barot, Harsh and Alon, Meidan and Cao, Yanshuai}, booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, month = aug, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.acl-demo.36}, doi = {10.18653/v1/2021.acl-demo.36}, pages = {298--305}, }
Workshop
A Globally Normalized Neural Model for Semantic Parsing

Chenyang Huang, Wei Yang, Yanshuai Cao, Osmar Zaı̈ane, and Lili Mou

In Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021), Aug 2021

Abs DOI Bib HTML

In this paper, we propose a globally normalized model for context-free grammar (CFG)-based semantic parsing. Instead of predicting a probability, our model predicts a real-valued score at each step and does not suffer from the label bias problem. Experiments show that our approach outperforms locally normalized models on small datasets, but it does not yield improvement on a large dataset.
@inproceedings{huang-etal-2021-globally, title = {A Globally Normalized Neural Model for Semantic Parsing}, author = {Huang, Chenyang and Yang, Wei and Cao, Yanshuai and Za{\"\i}ane, Osmar and Mou, Lili}, booktitle = {Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021)}, month = aug, year = {2021}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.spnlp-1.7}, doi = {10.18653/v1/2021.spnlp-1.7}, pages = {61--66}, }

Patent

Method and device for generative adversarial network training

BOSE Avishek, and CAO Yanshuai

Jul 2021

US Patent 11,062,179

Bib

@misc{avishek2021method,
  title = {Method and device for generative adversarial network training},
  author = {Avishek, BOSE and Yanshuai, CAO},
  year = {2021},
  month = jul,
  note = {US Patent 11,062,179}
}

Patent

System, methods, and devices for visual construction of operations for data querying

CAO Yanshuai, and Luyu Wang

Aug 2021

US Patent 11,080,292

Bib

@misc{yanshuai2021system,
  title = {System, methods, and devices for visual construction of operations for data querying},
  author = {Yanshuai, CAO and Wang, Luyu},
  year = {2021},
  month = aug,
  note = {US Patent 11,080,292}
}

Patent

System and method for testing machine learning

Yik Chau Lui, and CAO Yanshuai

Oct 2021

US Patent App. 17/227,086

Bib

2020

AISTATS
Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer

Yanshuai Cao^*, and Peng Xu^*

In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 26–28 aug 2020

Abs arXiv Bib HTML PDF Code

In this work, we develop a novel regularizer to improve the learning of long-range dependency of sequence data. Applied on language modelling, our regularizer expresses the inductive bias that sequence variables should have high mutual information even though the model might not see abundant observations for complex long-range dependency. We show how the "next sentence prediction (classification)" heuristic can be derived in a principled way from our mutual information estimation framework, and be further extended to maximize the mutual information of sequence variables. The proposed approach not only is effective at increasing the mutual information of segments under the learned model but more importantly, leads to a higher likelihood on holdout data, and improved generation quality. Code is releasedat https://github.com/BorealisAI/BMI.
@inproceedings{pmlr-v108-cao20a, title = {Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer}, author = {Cao, Yanshuai and Xu, Peng}, booktitle = {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics}, pages = {3991--4001}, year = {2020}, editor = {Chiappa, Silvia and Calandra, Roberto}, volume = {108}, series = {Proceedings of Machine Learning Research}, address = {Online}, month = {26--28 Aug}, publisher = {PMLR}, }
ICML
Evaluating Lossy Compression Rates of Deep Generative Models

Sicong Huang^*, Alireza Makhzani^*, Yanshuai Cao, and Roger Grosse

In Proceedings of the 37th International Conference on Machine Learning, 13–18 jul 2020

Abs arXiv Bib HTML PDF Code

The field of deep generative modeling has succeeded in producing astonishingly realistic-seeming images and audio, but quantitative evaluation remains a challenge. Log-likelihood is an appealing metric due to its grounding in statistics and information theory, but it can be challenging to estimate for implicit generative models, and scalar-valued metrics give an incomplete picture of a model’s quality. In this work, we propose to use rate distortion (RD) curves to evaluate and compare deep generative models. While estimating RD curves is seemingly even more computationally demanding than log-likelihood estimation, we show that we can approximate the entire RD curve using nearly the same computations as were previously used to achieve a single log-likelihood estimate. We evaluate lossy compression rates of VAEs, GANs, and adversarial autoencoders (AAEs) on the MNIST and CIFAR10 datasets. Measuring the entire RD curve gives a more complete picture than scalar-valued metrics, and we arrive at a number of insights not obtainable from log-likelihoods alone.
@inproceedings{pmlr-v119-huang20c, title = {Evaluating Lossy Compression Rates of Deep Generative Models}, author = {Huang, Sicong and Makhzani, Alireza and Cao, Yanshuai and Grosse, Roger}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {4444--4454}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, address = {Virtual}, month = {13--18 Jul}, publisher = {PMLR}, }
ICML
On Variational Learning of Controllable Representations for Text without Supervision

Peng Xu, Jackie Chi Kit Cheung, and Yanshuai Cao

In Proceedings of the 37th International Conference on Machine Learning, 13–18 jul 2020

Abs arXiv Bib HTML PDF Code

The variational autoencoder (VAE) can learn the manifold of natural images on certain datasets, as evidenced by meaningful interpolating or extrapolating in the continuous latent space. However, on discrete data such as text, it is unclear if unsupervised learning can discover similar latent space that allows controllable manipulation. In this work, we find that sequence VAEs trained on text fail to properly decode when the latent codes are manipulated, because the modified codes often land in holes or vacant regions in the aggregated posterior latent space, where the decoding network fails to generalize. Both as a validation of the explanation and as a fix to the problem, we propose to constrain the posterior mean to a learned probability simplex, and performs manipulation within this simplex. Our proposed method mitigates the latent vacancy problem and achieves the first success in unsupervised learning of controllable representations for text. Empirically, our method outperforms unsupervised baselines and strong supervised approaches on text style transfer, and is capable of performing more flexible fine-grained control over text generation than existing methods.
@inproceedings{pmlr-v119-xu20a, title = {On Variational Learning of Controllable Representations for Text without Supervision}, author = {Xu, Peng and Cheung, Jackie Chi Kit and Cao, Yanshuai}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {10534--10543}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, address = {Virtual}, month = {13--18 Jul}, publisher = {PMLR}, url = {http://proceedings.mlr.press/v119/xu20a.html}, }
Preprint
Variational Hyper RNN for Sequence Modeling

Ruizhi Deng, Yanshuai Cao, Bo Chang, Leonid Sigal, Greg Mori, and Marcus A Brubaker

arXiv preprint arXiv:2002.10501, 13–18 jul 2020

Abs arXiv Bib

In this work, we propose a novel probabilistic sequence model that excels at capturing high variability in time series data, both across sequences and within an individual sequence. Our method uses temporal latent variables to capture information about the underlying data pattern and dynamically decodes the latent information into modifications of weights of the base decoder and recurrent model. The efficacy of the proposed method is demonstrated on a range of synthetic and real-world sequential data that exhibit large scale variations, regime shifts, and complex dynamics.
@article{deng2020variational, title = {Variational Hyper RNN for Sequence Modeling}, author = {Deng, Ruizhi and Cao, Yanshuai and Chang, Bo and Sigal, Leonid and Mori, Greg and Brubaker, Marcus A}, journal = {arXiv preprint arXiv:2002.10501}, year = {2020}, }

Patent

System and method for adaptive data visualization

Luyu Wang, and CAO Yanshuai

Aug 2020

US Patent 10,739,955

Bib

Patent

Systems and methods for cyberbot network detection

Ashkan Amiri, Bryce Croll, FONG Cory, Athinthra Krishnaswamy Sethurajan, Vikash Yadav, Sylvester King Chun Chiang, QIN Zhengyi, Cathal Smyth, Yik Chau Lui, CAO Yanshuai, and others

Oct 2020

US Patent 10,819,724

Bib

@misc{amiri2020systems,
  title = {Systems and methods for cyberbot network detection},
  author = {Amiri, Ashkan and Croll, Bryce and Cory, FONG and Sethurajan, Athinthra Krishnaswamy and Yadav, Vikash and Chiang, Sylvester King Chun and Zhengyi, QIN and Smyth, Cathal and Lui, Yik Chau and Yanshuai, CAO and others},
  year = {2020},
  month = oct,
  note = {US Patent 10,819,724}
}

Patent

Systems and methods for malicious code detection

Cathal Smyth, FONG Cory, Yik Chau Lui, and CAO Yanshuai

Jun 2020

US Patent 10,685,284

Bib

@misc{smyth2020systems,
  title = {Systems and methods for malicious code detection},
  author = {Smyth, Cathal and Cory, FONG and Lui, Yik Chau and Yanshuai, CAO},
  year = {2020},
  month = jun,
  note = {US Patent 10,685,284}
}

Patent

System and method for reproducible machine learning

Weiguang Ding, and CAO Yanshuai

Oct 2020

US Patent 10,802,822

Bib

@misc{ding2020system,
  title = {System and method for reproducible machine learning},
  author = {Ding, Weiguang and Yanshuai, CAO},
  year = {2020},
  month = oct,
  note = {US Patent 10,802,822}
}

2019

Preprint
Preventing Posterior Collapse in Sequence VAEs with Pooling

Teng Long, Yanshuai Cao, and Jackie Chi Kit Cheung

arXiv preprint arXiv:1911.03976, Oct 2019

Abs arXiv Bib

Variational autoencoders (VAEs) hold great potential for modelling text, as they could in theory separate high-level semantic and syntactic properties from local regularities of natural language. Practically, however, VAEs with autoregressive decoders often suffer from posterior collapse, a phenomenon where the model learns to ignore the latent variables, causing the sequence VAE to degenerate into a language model. In this paper, we argue that posterior collapse is in part caused by the lack of dispersion in encoder features. We provide empirical evidence to verify this hypothesis, and propose a straightforward fix using pooling. This simple technique effectively prevents posterior collapse, allowing model to achieve significantly better data log-likelihood than standard sequence VAEs. Comparing to existing work, our proposed method is able to achieve comparable or superior performances while being more computationally efficient.
@article{long2019preventing, title = {Preventing Posterior Collapse in Sequence VAEs with Pooling}, author = {Long, Teng and Cao, Yanshuai and Cheung, Jackie Chi Kit}, journal = {arXiv preprint arXiv:1911.03976}, year = {2019}, }
ACL
A Cross-Domain Transferable Neural Coherence Model

Peng Xu, Hamidreza Saghir, Jin Sung Kang, Teng Long, Avishek Joey Bose, Yanshuai Cao, and Jackie Chi Kit Cheung

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jul 2019

Abs DOI arXiv Bib Code

Coherence is an important aspect of text quality and is crucial for ensuring its readability. One important limitation of existing coherence models is that training on one domain does not easily generalize to unseen categories of text. Previous work advocates for generative models for cross-domain generalization, because for discriminative models, the space of incoherent sentence orderings to discriminate against during training is prohibitively large. In this work, we propose a local discriminative neural model with a much smaller negative sampling space that can efficiently learn against incorrect orderings. The proposed coherence model is simple in structure, yet it significantly outperforms previous state-of-art methods on a standard benchmark dataset on the Wall Street Journal corpus, as well as in multiple new challenging settings of transfer to unseen categories of discourse on Wikipedia articles.
@inproceedings{xu-etal-2019-cross, title = {A Cross-Domain Transferable Neural Coherence Model}, author = {Xu, Peng and Saghir, Hamidreza and Kang, Jin Sung and Long, Teng and Bose, Avishek Joey and Cao, Yanshuai and Cheung, Jackie Chi Kit}, booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, month = jul, year = {2019}, address = {Florence, Italy}, publisher = {Association for Computational Linguistics}, url = {https://www.aclweb.org/anthology/P19-1067}, doi = {10.18653/v1/P19-1067}, pages = {678--687}, }

2018

Preprint
Few-shot Self-Reminder to Overcome Catastrophic Forgetting

Junfeng Wen, Yanshuai Cao, and Ruitong Huang

NeurIPS 2018 Workshop on Continual Learning, Jul 2018

Abs arXiv Bib

Deep neural networks are known to suffer the catastrophic forgetting problem, where they tend to forget the knowledge from the previous tasks when sequentially learning new tasks. Such failure hinders the application of deep learning based vision system in continual learning settings. In this work, we present a simple yet surprisingly effective way of preventing catastrophic forgetting. Our method, called Few-shot Self Reminder (FSR), regularizes the neural net from changing its learned behaviour by performing logit matching on selected samples kept in episodic memory from the old tasks. Surprisingly, this simplistic approach only requires to retrain a small amount of data in order to outperform previous methods in knowledge retention. We demonstrate the superiority of our method to the previous ones in two different continual learning settings on popular benchmarks, as well as a new continual learning problem where tasks are designed to be more dissimilar.
@article{wen2018few, title = {Few-shot Self-Reminder to Overcome Catastrophic Forgetting}, author = {Wen, Junfeng and Cao, Yanshuai and Huang, Ruitong}, journal = {NeurIPS 2018 Workshop on Continual Learning}, year = {2018}, }
Workshop
Compositional Hard Negatives for Visual Semantic Embeddings via an Adversary

A. Bose, Huan Ling, and Yanshuai Cao

NeurIPS 2018 Workshop on ViGIL, Jul 2018

Abs Bib HTML

Learning high-quality representations for multi-modal data with a shared underlying meaning is a key building block for cross-modal information retrieval. Further, hard negative mining has been shown effective in forcing models to learn discriminaitve features for accurate retrieval. In this paper, we present a new technique for hard negative mining for learning visual-semantic embeddings, with an adversary that is learned in a min-max game with the cross-modal embedding model. The adversary exploits compositionality of images and texts and is able to compose harder negatives through novel combination of objects and regions across different images for a given caption. We find that our approach leads to higher scores across-the-board for all R@K based metrics over the previous state of the art.
@article{bose2018compositional, title = {Compositional Hard Negatives for Visual Semantic Embeddings via an Adversary}, author = {Bose, A. and Ling, Huan and Cao, Yanshuai}, year = {2018}, journal = {NeurIPS 2018 Workshop on ViGIL}, }
ICLR
Improving GAN Training via Binarized Representation Entropy (BRE) Regularization

Yanshuai Cao, Gavin Weiguang Ding, Kry Yik-Chau Lui, and Ruitong Huang

International Conference on Learning Representations, Jul 2018

Abs arXiv Bib HTML Code

We propose a novel regularizer to improve the training of Generative Adversarial Networks (GANs). The motivation is that when the discriminator D spreads out its model capacity in the right way, the learning signals given to the generator G are more informative and diverse. These in turn help G to explore better and discover the real data manifold while avoiding large unstable jumps due to the erroneous extrapolation made by D. Our regularizer guides the rectifier discriminator D to better allocate its model capacity, by encouraging the binary activation patterns on selected internal layers of D to have a high joint entropy. Experimental results on both synthetic data and real datasets demonstrate improvements in stability and convergence speed of the GAN training, as well as higher sample quality. The approach also leads to higher classification accuracies in semi-supervised learning.
@article{Cao2018Improving, title = {Improving GAN Training via Binarized Representation Entropy (BRE) Regularization}, author = {Cao, Yanshuai and Ding, Gavin Weiguang and Lui, Kry Yik-Chau and Huang, Ruitong}, journal = {International Conference on Learning Representations}, year = {2018}, }
Preprint
Adversarial Robustness of Pruned Neural Networks

Luyu Wang, Gavin Weiguang Ding, Ruitong Huang, Yanshuai Cao, and Yik Chau Lui

Jul 2018

Abs Bib HTML

Deep neural network pruning forms a compressed network by discarding "unimportant" weights or filters. Standard evaluation metrics have shown their remarkable speedup and prediction accuracy in test time, but their adversarial robustness remains unexplored even though it is an important security feature in deployment. We study the robustness of pruned neural networks under adversarial attacks. We discover that although pruned models maintain the original accuracy, they are more vulnerable to such attacks. We further show that adversarial training improves the robustness of pruned networks. However, it is observed there exist trade-offs among compression rate, accuracy and robustness in adversarially trained pruned neural networks. Our analysis suggests that we should pay additional attention to robustness in neural network pruning rather than just maintaining the classification accuracy.
@article{wang2018adversarial, title = {Adversarial Robustness of Pruned Neural Networks}, author = {Wang, Luyu and Ding, Gavin Weiguang and Huang, Ruitong and Cao, Yanshuai and Lui, Yik Chau}, year = {2018}, }
ACL
Adversarial Contrastive Estimation

Avishek Joey Bose^*, Huan Ling^*, and Yanshuai Cao^*

In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2018

Abs DOI arXiv Bib HTML

Learning by contrasting positive and negative samples is a general strategy adopted by many methods. Noise contrastive estimation (NCE) for word embeddings and translating embeddings for knowledge graphs are examples in NLP employing this approach. In this work, we view contrastive learning as an abstraction of all such methods and augment the negative sampler into a mixture distribution containing an adversarially learned sampler. The resulting adaptive sampler finds harder negative examples, which forces the main model to learn a better representation of the data. We evaluate our proposal on learning word embeddings, order embeddings and knowledge graph embeddings and observe both faster convergence and improved results on multiple metrics.
@inproceedings{bose-etal-2018-adversarial, title = {Adversarial Contrastive Estimation}, author = {Bose, Avishek Joey and Ling, Huan and Cao, Yanshuai}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2018}, address = {Melbourne, Australia}, publisher = {Association for Computational Linguistics}, url = {https://www.aclweb.org/anthology/P18-1094}, doi = {10.18653/v1/P18-1094}, pages = {1021--1032}, }
PhD Thesis
Scaling Gaussian Processes

Yanshuai Cao

Jul 2018

Abs Bib HTML

We explore ways to scale Gaussian processes (GP) to large datasets. Two methods with different theoretical and practical motivations are proposed. The first method solves the open problem of efficient discrete inducing set selection in the context of inducing point based approximation to full GPs. When inducing points need to be chosen from the training set, the proposed method is the only principled approach to date for joint tuning of inducing set and GP hyperparameters while scaling linearly in the number of training set size during learning. Empirically it achieves a trade-off between speed and accuracy that is comparable to other state-of-arts inducing point GP methods. The second method is a novel framework for building flexible probabilistic prediction models based on GPs that is simple to parallelize and highly scalable. Referred to as transductive fusion, this second approach learns separate GP experts whose predictions are combined in ways that depend on test point locations. A number of new models are proposed in this new framework. Learning and inference in these new models are straightforwardly parallel, and predictive accuracy is shown to be satisfactory empirically.
@article{cao_yanshuai_gp_thesis, author = {Cao, Yanshuai}, title = {Scaling Gaussian Processes}, school = {University of Toronto}, year = {2018}, }

2017

Workshop

Implicit Manifold Learning on Generative Adversarial Networks

Kry Yik Chau Lui, Yanshuai Cao, Maxime Gazeau, and Kelvin Shuangjian Zhang

ICML 2017 Workshop on Implicit Models, Jul 2017

arXiv Bib

@article{lui2017implicit,
  title = {Implicit Manifold Learning on Generative Adversarial Networks},
  author = {Lui, Kry Yik Chau and Cao, Yanshuai and Gazeau, Maxime and Zhang, Kelvin Shuangjian},
  year = {2017},
  journal = {ICML 2017 Workshop on Implicit Models}
}

Workshop
Automatic Selection of t-SNE Perplexity

Yanshuai Cao, and Luyu Wang

ICML 2017 Workshop on AutoML, Jul 2017

Abs arXiv Bib

t-Distributed Stochastic Neighbor Embedding (t-SNE) is one of the most widely used dimensionality reduction methods for data visualization, but it has a perplexity hyperparameter that requires manual selection. In practice, proper tuning of t-SNE perplexity requires users to understand the inner working of the method as well as to have hands-on experience. We propose a model selection objective for t-SNE perplexity that requires negligible extra computation beyond that of the t-SNE itself. We empirically validate that the perplexity settings found by our approach are consistent with preferences elicited from human experts across a number of datasets. The similarities of our approach to Bayesian information criteria (BIC) and minimum description length (MDL) are also analyzed.
@article{CaoAST, title = {Automatic Selection of t-SNE Perplexity}, author = {Cao, Yanshuai and Wang, Luyu}, year = {2017}, journal = {ICML 2017 Workshop on AutoML}, url = {http://arxiv.org/abs/1708.03229}, }

2016

ICLR
Adversarial Manipulation of Deep Representations

Sara Sabour^*, Yanshuai Cao^*, Fartash Faghri, and David J. Fleet

Jul 2016

Abs arXiv Bib Code

We show that the image representations in a deep neural network (DNN) can be manipulated to mimic those of other natural images, with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image perturbations designed to produce erroneous class labels. Here we instead concentrate on the internal layers of DNN representations, to produce a new class of adversarial images that differs qualitatively from others. While the adversary is perceptually similar to one image, its internal representation appears remarkably similar to a different image, from a different class and bearing little if any apparent similarity to the input. Further, they appear generic and consistent with the space of natural images. This phenomenon demonstrates the possibility to trick a DNN to confound almost any image with any other chosen image, and raises questions about DNN representations, as well as the properties of natural images themselves.
@article{sabour15, title = {Adversarial Manipulation of Deep Representations}, author = {Sabour, Sara and Cao, Yanshuai and Faghri, Fartash and Fleet, David J.}, booktitle = {4th International Conference on Learning Representations}, year = {2016}, url = {http://arxiv.org/abs/1511.05122}, }

2015

TPAMI
Efficient Optimization for Sparse Gaussian Process Regression

Yanshuai Cao, Marcus A Brubaker, David J Fleet, and Aaron Hertzmann

IEEE Transactions on Pattern Analysis and Machine Intelligence, Jul 2015

Abs DOI Bib HTML Code

We propose an efficient optimization algorithm to select a subset of training data as the inducing set for sparse Gaussian process regression. Previous methods either use different objective functions for inducing set and hyperparameter selection, or else optimize the inducing set by gradient-based continuous optimization. The former approaches are harder to interpret and suboptimal, whereas the latter cannot be applied to discrete input domains or to kernel functions that are not differentiable with respect to the input. The algorithm proposed in this work estimates an inducing set and the hyperparameters using a single objective. It can be used to optimize either the marginal likelihood or a variational free energy. Space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-of-art performance in discrete cases, competitive prediction results as well as a favorable trade-off between training and test time in continuous cases.
@article{tpami15_caoy, author = {Cao, Yanshuai and Brubaker, Marcus A and Fleet, David J and Hertzmann, Aaron}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, title = {Efficient Optimization for Sparse Gaussian Process Regression}, year = {2015}, volume = {37}, number = {12}, pages = {2415-2427}, doi = {10.1109/TPAMI.2015.2424873}, }
Workshop
Transductive Log Opinion Pool of Gaussian Process Experts

Yanshuai Cao, and David J Fleet

NIPS2015 Workshop on Nonparametric Methods for Large Scale Representation Learning, Jul 2015

Abs arXiv Bib

We introduce a framework for analyzing transductive combination of Gaussian process (GP) experts, where independently trained GP experts are combined in a way that depends on test point location, in order to scale GPs to big data. The framework provides some theoretical justification for the generalized product of GP experts (gPoE-GP) which was previously shown to work well in practice [2, 3] but lacks theoretical basis. Based on the proposed framework, an improvement over gPoE-GP is introduced and empirically validated.
@article{cao2015transductive, title = {Transductive Log Opinion Pool of Gaussian Process Experts}, author = {Cao, Yanshuai and Fleet, David J}, journal = {NIPS2015 Workshop on Nonparametric Methods for Large Scale Representation Learning}, year = {2015}, }

2014

Workshop
Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions

Yanshuai Cao, and David J Fleet

Modern Nonparametrics 3: Automating the Learning Pipeline Workshop at NIPS, Jul 2014

Abs arXiv Bib

In this work, we propose a generalized product of experts (gPoE) framework for combining the predictions of multiple probabilistic models. We identify four desirable properties that are important for scalability, expressiveness and robustness, when learning and inferring with a combination of multiple models. Through analysis and experiments, we show that gPoE of Gaussian processes (GP) have these qualities, while no other existing combination schemes satisfy all of them at the same time. The resulting GP-gPoE is highly scalable as individual GP experts can be independently learned in parallel; very expressive as the way experts are combined depends on the input rather than fixed; the combined prediction is still a valid probabilistic model with natural interpretation; and finally robust to unreliable predictions from individual experts.
@article{cao2014generalized, title = {Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions}, author = {Cao, Yanshuai and Fleet, David J}, journal = {Modern Nonparametrics 3: Automating the Learning Pipeline Workshop at NIPS}, year = {2014}, }

2013

NeurIPS
Efficient Optimization for Sparse Gaussian Process Regression

Yanshuai Cao, Marcus A Brubaker, David J Fleet, and Aaron Hertzmann

In Advances in Neural Information Processing Systems, Jul 2013

Abs arXiv Bib HTML Supp Code

We propose an efficient discrete optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates this inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in the training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-of-art performance in the discrete case and competitive results in the continuous case.
@inproceedings{NIPS2013_46922a08, author = {Cao, Yanshuai and Brubaker, Marcus A and Fleet, David J and Hertzmann, Aaron}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Burges, C. J. C. and Bottou, L. and Welling, M. and Ghahramani, Z. and Weinberger, K. Q.}, pages = {1097--1105}, publisher = {Curran Associates, Inc.}, title = {Efficient Optimization for Sparse Gaussian Process Regression}, volume = {26}, year = {2013}, }