I am a research scientist at Foundational AI Research (FAIR), Meta. I am passionate about building general intelligent systems that process information at scale and assist humans in various knowledge-intensive tasks.
Please refer to my CV for a comprehensive overview of my experience.
Project Highlights
Large-scale causal language models have demonstrated impressive few-shot learning capabilities. These models have been primarily built for English and a few other high-resource languages. Given there are over 7,000 languages in the world, developing language models for each of them is expensive and neglects the positive transfer between related languages. We address this problem by training multilingual language models (XGLMs) on a mixture of diverse languages, where significant presense of the lower-resourced languages is achieved via up-sampling. Our largest model with 7.5 billion parameters enabled few-shot learning in 20+ languages on text completion and language inference tasks. It also demonstrates strong cross-lingual transfer and sets new state-of-the-art in few-short machine translation in the lower-resourced regime. [ ArXiv'21]
Large-scale causal language models have demonstrated impressive few-shot learning capabilities. These models have been primarily built for English and a few other high-resource languages. Given there are over 7,000 languages in the world, developing language models for each of them is expensive and neglects the positive transfer between related languages. We address this problem by training multilingual language models (XGLMs) on a mixture of diverse languages, where significant presense of the lower-resourced languages is achieved via up-sampling. Our largest model with 7.5 billion parameters enabled few-shot learning in 20+ languages on text completion and language inference tasks. It also demonstrates strong cross-lingual transfer and sets new state-of-the-art in few-short machine translation in the lower-resourced regime. [ ArXiv'21]
Tellina is an end-user scripting assistant that can be queried via natural language. It translates a natural language sentence typed by the user into a piece of short, executable script. The underlying models are neural encoder-decoders trained on NL-script pairs collected by programming experts from online tutorials and question-answering forums. We instantiate the prototype in Bash.
This work poses several challenges including scalable data collection, never-ending learning and personalization, most of which are central to all practical semantic parsing systems.
[ LREC'18, UW-CSE-TR'17]
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts. Xi Victoria Lin*, Akshat Shrivastava*, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan*
ArXiv 2024.
PDF
Abstract
Bibtex
alphaXiv
@misc{lin2024momaefficientearlyfusionpretraining,
title={MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts},
author={Xi Victoria Lin and Akshat Shrivastava and Liang Luo and Srinivasan Iyer and Mike Lewis and Gargi Ghosh and Luke Zettlemoyer and Armen Aghajanyan},
year={2024},
eprint={2407.21770},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2407.21770},
}
NEST: Nearest Neighbor Speculative Decoding for LLM Generation and Attribution.
Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Scott Wen-tau Yih, Xi Victoria Lin NeurIPS 2024.
PDF
Abstract
Bibtex
@misc{li2024nearestneighborspeculativedecoding,
title={Nearest Neighbor Speculative Decoding for LLM Generation and Attribution},
author={Minghan Li and Xilun Chen and Ari Holtzman and Beidi Chen and Jimmy Lin and Wen-tau Yih and Xi Victoria Lin},
year={2024},
eprint={2405.19325},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.19325},
}
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.
Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Shang-Wen Daniel Li, Scott Wen-tau Yih, Jason Weston, Xian Li
COLM 2024.
PDF
Abstract
Bibtex
@misc{sukhbaatar2024branchtrainmixmixingexpertllms,
title={Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM},
author={Sainbayar Sukhbaatar and Olga Golovneva and Vasu Sharma and Hu Xu and Xi Victoria Lin and Baptiste Rozière and Jacob Kahn and Daniel Li and Wen-tau Yih and Jason Weston and Xian Li},
year={2024},
eprint={2403.07816},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2403.07816},
}
Instruction-tuned Language Models are Better Knowledge Learners.
Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Scott Wen-tau Yih, Srinivasan Iyer
ACL 2024.
PDF
Abstract
Bibtex
Code
@inproceedings{jiang-etal-2024-instruction,
title = "Instruction-tuned Language Models are Better Knowledge Learners",
author = "Jiang, Zhengbao and
Sun, Zhiqing and
Shi, Weijia and
Rodriguez, Pedro and
Zhou, Chunting and
Neubig, Graham and
Lin, Xi and
Yih, Wen-tau and
Iyer, Srini",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.296",
pages = "5421--5434",
abstract = "In order for large language model (LLM)-based assistants to effectively adapt to evolving information needs, it must be possible to update their factual knowledge through continued training on new data. The standard recipe for doing so involves continued pre-training on new documents followed by instruction-tuning on question-answer (QA) pairs. However, we find that LLMs trained with this recipe struggle to answer questions, even though the perplexity of documents is minimized. We found that QA pairs are generally straightforward, while documents are more complex, weaving many factual statements together in an intricate manner. Therefore, we hypothesize that it is beneficial to expose LLMs to QA pairs before continued pre-training on documents so that the process of encoding knowledge from complex documents takes into account how this knowledge is accessed through questions. Based on this, we propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents. This contrasts with standard instruction-tuning, which learns how to extract knowledge after training on documents. Extensive experiments and ablation studies demonstrate that pre-instruction-tuning significantly enhances the ability of LLMs to absorb knowledge from new documents, outperforming standard instruction-tuning by 17.8{\%}.",
}
RA-DIT: Retrieval-Augmented Dual Instruction Tuning. Xi Victoria Lin*, Xilun Chen*, Mingda Chen*, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Scott Wen-tau Yih ICLR 2024.
PDF
Abstract
Bibtex
Talks
@inproceedings{DBLP:conf/iclr/Lin0CSL00KSLZY24,
author = {Xi Victoria Lin and
Xilun Chen and
Mingda Chen and
Weijia Shi and
Maria Lomeli and
Richard James and
Pedro Rodriguez and
Jacob Kahn and
Gergely Szilvasy and
Mike Lewis and
Luke Zettlemoyer and
Wen{-}tau Yih},
title = {{RA-DIT:} Retrieval-Augmented Dual Instruction Tuning},
booktitle = {The Twelfth International Conference on Learning Representations,
{ICLR} 2024, Vienna, Austria, May 7-11, 2024},
publisher = {OpenReview.net},
year = {2024},
url = {https://openreview.net/forum?id=22OTbutug9},
timestamp = {Wed, 07 Aug 2024 17:11:53 +0200},
biburl = {https://dblp.org/rec/conf/iclr/Lin0CSL00KSLZY24.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
In-Context Pretraining: Language Modeling Beyond Document Boundaries.
Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Wen-tau Yih, Mike Lewis ICLR 2024.
PDF
Abstract
Bibtex
@misc{shi2023incontext,
title={In-Context Pretraining: Language Modeling Beyond Document Boundaries},
author={Weijia Shi and Sewon Min and Maria Lomeli and Chunting Zhou and Margaret Li and Rich James and Xi Victoria Lin and Noah A. Smith and Luke Zettlemoyer and Scott Yih and Mike Lewis},
year={2023},
eprint={2310.10638},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
2023
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model. Leo Z. Liu, Tim Dettmers, Xi Victoria Lin, Veselin Stoyanov, Xian Li. EMNLP 2023.
PDF
Abstract
Bibtex
@misc{liu2023unified,
title={Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model},
author={Leo Z. Liu and Tim Dettmers and Xi Victoria Lin and Veselin Stoyanov and Xian Li},
year={2023},
eprint={2305.13999},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
LEVER: Learning to Verify Language-to-Code Generation with Execution. Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Scott Wen-tau Yih, Sida I. Wang*, Xi Victoria Lin*. ICML 2023.
PDF
Abstract
Bibtex
Dataset & Code
@inproceedings{DBLP:conf/icml/Ni0RSYWL23,
author = {Ansong Ni and
Srini Iyer and
Dragomir Radev and
Veselin Stoyanov and
Wen{-}Tau Yih and
Sida I. Wang and
Xi Victoria Lin},
editor = {Andreas Krause and
Emma Brunskill and
Kyunghyun Cho and
Barbara Engelhardt and
Sivan Sabato and
Jonathan Scarlett},
title = {{LEVER:} Learning to Verify Language-to-Code Generation with Execution},
booktitle = {International Conference on Machine Learning, {ICML} 2023, 23-29 July
2023, Honolulu, Hawaii, {USA}},
series = {Proceedings of Machine Learning Research},
volume = {202},
pages = {26106--26128},
publisher = {{PMLR}},
year = {2023},
url = {https://proceedings.mlr.press/v202/ni23b.html},
timestamp = {Mon, 28 Aug 2023 17:23:08 +0200},
biburl = {https://dblp.org/rec/conf/icml/Ni0RSYWL23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Training Trajectories of Language Models Across Scales . Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov. ACL 2023.
PDF
Abstract
Bibtex
Code
@inproceedings{DBLP:conf/acl/XiaAZLPCZS23,
author = {Mengzhou Xia and
Mikel Artetxe and
Chunting Zhou and
Xi Victoria Lin and
Ramakanth Pasunuru and
Danqi Chen and
Luke Zettlemoyer and
Veselin Stoyanov},
editor = {Anna Rogers and
Jordan L. Boyd{-}Graber and
Naoaki Okazaki},
title = {Training Trajectories of Language Models Across Scales},
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2023, Toronto, Canada,
July 9-14, 2023},
pages = {13711--13738},
publisher = {Association for Computational Linguistics},
year = {2023},
url = {https://doi.org/10.18653/v1/2023.acl-long.767},
doi = {10.18653/v1/2023.acl-long.767},
timestamp = {Thu, 10 Aug 2023 12:36:04 +0200},
biburl = {https://dblp.org/rec/conf/acl/XiaAZLPCZS23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Reimagining Retrieval Augmented Language Models for Answering Queries. Wang-Chiew Tan, Yuliang Li, Pedro Rodriguez, Richard James, Xi Victoria Lin, Alon Halevy, Scott Wen-tau Yih. ACL 2023 Findings.
PDF
Abstract
Bibtex
@inproceedings{DBLP:conf/acl/Tan0RJLHY23,
author = {Wang{-}Chiew Tan and
Yuliang Li and
Pedro Rodriguez and
Richard James and
Xi Victoria Lin and
Alon Y. Halevy and
Wen{-}tau Yih},
editor = {Anna Rogers and
Jordan L. Boyd{-}Graber and
Naoaki Okazaki},
title = {Reimagining Retrieval Augmented Language Models for Answering Queries},
booktitle = {Findings of the Association for Computational Linguistics: {ACL} 2023,
Toronto, Canada, July 9-14, 2023},
pages = {6131--6146},
publisher = {Association for Computational Linguistics},
year = {2023},
url = {https://doi.org/10.18653/v1/2023.findings-acl.382},
doi = {10.18653/v1/2023.findings-acl.382},
timestamp = {Thu, 17 Aug 2023 12:47:06 +0200},
biburl = {https://dblp.org/rec/conf/acl/Tan0RJLHY23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
2022
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. Srinivasan Iyer*, Xi Victoria Lin*, Ramakanth Pasunuru*, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, Ves Stoyanov Technical Report 2022.
PDF
Abstract
Bibtex
Checkpoints & Code
HuggingFace
@article{DBLP:journals/corr/abs-2212-12017,
author = {Srinivasan Iyer and
Xi Victoria Lin and
Ramakanth Pasunuru and
Todor Mihaylov and
Daniel Simig and
Ping Yu and
Kurt Shuster and
Tianlu Wang and
Qing Liu and
Punit Singh Koura and
Xian Li and
Brian O'Horo and
Gabriel Pereyra and
Jeff Wang and
Christopher Dewan and
Asli Celikyilmaz and
Luke Zettlemoyer and
Ves Stoyanov},
title = {{OPT-IML:} Scaling Language Model Instruction Meta Learning through
the Lens of Generalization},
journal = {CoRR},
volume = {abs/2212.12017},
year = {2022},
url = {https://doi.org/10.48550/arXiv.2212.12017},
doi = {10.48550/arXiv.2212.12017},
eprinttype = {arXiv},
eprint = {2212.12017},
timestamp = {Wed, 04 Jan 2023 16:01:37 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2212-12017.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
OPT: Open Pre-trained Transformer Language Models. Susan Zhang*, Stephen Roller*, Naman Goyal*, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer. Technical Report 2022.
PDF
Abstract
Bibtex
Blog
Checkpoints & Code
HuggingFace
@article{DBLP:journals/corr/abs-2205-01068,
author = {Susan Zhang and
Stephen Roller and
Naman Goyal and
Mikel Artetxe and
Moya Chen and
Shuohui Chen and
Christopher Dewan and
Mona T. Diab and
Xian Li and
Xi Victoria Lin and
Todor Mihaylov and
Myle Ott and
Sam Shleifer and
Kurt Shuster and
Daniel Simig and
Punit Singh Koura and
Anjali Sridhar and
Tianlu Wang and
Luke Zettlemoyer},
title = {{OPT:} Open Pre-trained Transformer Language Models},
journal = {CoRR},
volume = {abs/2205.01068},
year = {2022},
url = {https://doi.org/10.48550/arXiv.2205.01068},
doi = {10.48550/arXiv.2205.01068},
eprinttype = {arXiv},
eprint = {2205.01068},
timestamp = {Thu, 22 Sep 2022 19:27:06 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2205-01068.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Few-shot Learning with Multilingual Language Models. Xi Victoria Lin*, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li*. EMNLP 2022.
PDF
Abstract
Bibtex
Checkpoints & Code
HuggingFace
@article{DBLP:journals/corr/abs-2112-10668,
author = {Xi Victoria Lin and
Todor Mihaylov and
Mikel Artetxe and
Tianlu Wang and
Shuohui Chen and
Daniel Simig and
Myle Ott and
Naman Goyal and
Shruti Bhosale and
Jingfei Du and
Ramakanth Pasunuru and
Sam Shleifer and
Punit Singh Koura and
Vishrav Chaudhary and
Brian O'Horo and
Jeff Wang and
Luke Zettlemoyer and
Zornitsa Kozareva and
Mona T. Diab and
Veselin Stoyanov and
Xian Li},
title = {Few-shot Learning with Multilingual Language Models},
journal = {CoRR},
volume = {abs/2112.10668},
year = {2021},
url = {https://arxiv.org/abs/2112.10668},
eprinttype = {arXiv},
eprint = {2112.10668},
timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Efficient Large Scale Language Modeling with Mixtures of Experts. Mikel Artetxe*, Shruti Bhosale*, Naman Goyal*, Todor Mihaylov*, Myle Ott*, Sam Shleifer*, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov. EMNLP 2022.
PDF
Abstract
Bibtex
Checkpoints & Code
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ∼4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.
@article{DBLP:journals/corr/abs-2112-10684,
author = {Mikel Artetxe and
Shruti Bhosale and
Naman Goyal and
Todor Mihaylov and
Myle Ott and
Sam Shleifer and
Xi Victoria Lin and
Jingfei Du and
Srinivasan Iyer and
Ramakanth Pasunuru and
Giri Anantharaman and
Xian Li and
Shuohui Chen and
Halil Akin and
Mandeep Baines and
Louis Martin and
Xing Zhou and
Punit Singh Koura and
Brian O'Horo and
Jeff Wang and
Luke Zettlemoyer and
Mona T. Diab and
Zornitsa Kozareva and
Ves Stoyanov},
title = {Efficient Large Scale Language Modeling with Mixtures of Experts},
journal = {CoRR},
volume = {abs/2112.10684},
year = {2021},
url = {https://arxiv.org/abs/2112.10684},
eprinttype = {arXiv},
eprint = {2112.10684},
timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2112-10684.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Lifting the Curse of Multilinguality by Pre-training Modular Transformers. Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. NAACL 2022.
PDF
Abstract
Bibtex
Code
Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-Mod) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
@inproceedings{DBLP:conf/naacl/PfeifferGLLC0A22,
author = {Jonas Pfeiffer and
Naman Goyal and
Xi Victoria Lin and
Xian Li and
James Cross and
Sebastian Riedel and
Mikel Artetxe},
editor = {Marine Carpuat and
Marie{-}Catherine de Marneffe and
Iv{\'{a}}n Vladimir Meza Ru{\'{\i}}z},
title = {Lifting the Curse of Multilinguality by Pre-training Modular Transformers},
booktitle = {Proceedings of the 2022 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
{NAACL} 2022, Seattle, WA, United States, July 10-15, 2022},
pages = {3479--3495},
publisher = {Association for Computational Linguistics},
year = {2022},
url = {https://doi.org/10.18653/v1/2022.naacl-main.255},
doi = {10.18653/v1/2022.naacl-main.255},
timestamp = {Mon, 01 Aug 2022 16:28:01 +0200},
biburl = {https://dblp.org/rec/conf/naacl/PfeifferGLLC0A22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
On Continual Model Refinement in Out-of-Distribution Data Streams. Bill Yuchen Lin, Sida Wang, Xi Victoria Lin, Robin Jia, Lin Xiao, Xiang Ren, Scott Wen-tau Yih. ACL 2022.
PDF
Abstract
Bibtex
Dataset & Code
Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams while overcoming catastrophic forgetting. However, existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario. In response to this, we propose a new CL problem formulation dubbed continual model refinement (CMR). Compared to prior CL settings, CMR is more practical and introduces unique challenges (boundary-agnostic and non-stationary distribution shift, diverse mixtures of multiple OOD data clusters, error-centric streams, etc.). We extend several existing CL approaches to the CMR setting and evaluate them extensively. For benchmarking and analysis, we propose a general sampling algorithm to obtain dynamic OOD data streams with controllable non-stationarity, as well as a suite of metrics measuring various aspects of online performance. Our experiments and detailed analysis reveal the promise and challenges of the CMR problem, supporting that studying CMR in dynamic OOD streams can benefit the longevity of deployed NLP models in production.
@inproceedings{DBLP:conf/acl/LinWLJXRY22,
author = {Bill Yuchen Lin and
Sida Wang and
Xi Victoria Lin and
Robin Jia and
Lin Xiao and
Xiang Ren and
Scott Yih},
editor = {Smaranda Muresan and
Preslav Nakov and
Aline Villavicencio},
title = {On Continual Model Refinement in Out-of-Distribution Data Streams},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2022, Dublin, Ireland,
May 22-27, 2022},
pages = {3128--3139},
publisher = {Association for Computational Linguistics},
year = {2022},
url = {https://doi.org/10.18653/v1/2022.acl-long.223},
doi = {10.18653/v1/2022.acl-long.223},
timestamp = {Mon, 01 Aug 2022 16:27:42 +0200},
biburl = {https://dblp.org/rec/conf/acl/LinWLJXRY22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card Messages. Jiao Sun, Tongshuang Wu, Yue Jiang, Ronil Awalegaonkar, Xi Victoria Lin, Diyi Yang. CHI 2022.
PDF
Abstract
Bibtex
People write personalized greeting cards on various occasions. While prior work has studied gender roles in greeting card messages, systematic analysis at scale and tools for raising the awareness of gender stereotyping remain under-investigated. To this end, we collect a large greeting card message corpus covering three different occasions (birthday, Valentine's Day and wedding) from three sources (exemplars from greeting message websites, real-life greetings from social media and language model generated ones). We uncover a wide range of gender stereotypes in this corpus via topic modeling, odds ratio and Word Embedding Association Test (WEAT). We further conduct a survey to understand people's perception of gender roles in messages from this corpus and if gender stereotyping is a concern. The results show that people want to be aware of gender roles in the messages, but remain unconcerned unless the perceived gender roles conflict with the recipient's true personality. In response, we developed GreetA, an interactive visualization and writing assistant tool to visualize fine-grained topics in greeting card messages drafted by the users and the associated gender perception scores, but without suggesting text changes as an intervention.
@inproceedings{DBLP:conf/chi/SunWJALY22,
author = {Jiao Sun and
Tongshuang Wu and
Yue Jiang and
Ronil Awalegaonkar and
Xi Victoria Lin and
Diyi Yang},
editor = {Simone D. J. Barbosa and
Cliff Lampe and
Caroline Appert and
David A. Shamma and
Steven Mark Drucker and
Julie R. Williamson and
Koji Yatani},
title = {Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card
Messages},
booktitle = {{CHI} '22: {CHI} Conference on Human Factors in Computing Systems,
New Orleans, LA, USA, 29 April 2022 - 5 May 2022},
pages = {398:1--398:15},
publisher = {{ACM}},
year = {2022},
url = {https://doi.org/10.1145/3491102.3502114},
doi = {10.1145/3491102.3502114},
timestamp = {Fri, 29 Apr 2022 17:07:24 +0200},
biburl = {https://dblp.org/rec/conf/chi/SunWJALY22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
FeTaQA: Free-form Table Question Answering. Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kryściński, Nick Schoelkopf, Riley Kong, Xiangru Tang, Murori Mutuma, Ben Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, Dragomir Radev. TACL 2022.
PDF
Abstract
Bibtex
Code
Existing table question answering datasets contain abundant factual questions that primarily evaluate the query and schema comprehension capability of a system, but they fail to include questions that require complex reasoning and integration of information due to the constraint of the associated short-form answers. To address these issues and to demonstrate the full challenge of table question answering, we introduce FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs. FeTaQA yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source. Unlike datasets of generative QA over text in which answers are prevalent with copies of short text spans from the source, answers in our dataset are human-generated explanations involving entities and their high-level relations. We provide two benchmark methods for the proposed task: a pipeline method based on semantic-parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.
@article{DBLP:journals/tacl/NanHMLVZKSKTMRT22,
author = {Linyong Nan and
Chiachun Hsieh and
Ziming Mao and
Xi Victoria Lin and
Neha Verma and
Rui Zhang and
Wojciech Kryscinski and
Hailey Schoelkopf and
Riley Kong and
Xiangru Tang and
Mutethia Mutuma and
Ben Rosand and
Isabel Trindade and
Renusree Bandaru and
Jacob Cunningham and
Caiming Xiong and
Dragomir R. Radev},
title = {FeTaQA: Free-form Table Question Answering},
journal = {Trans. Assoc. Comput. Linguistics},
volume = {10},
pages = {35--49},
year = {2022},
url = {https://doi.org/10.1162/tacl\_a\_00446},
doi = {10.1162/tacl\_a\_00446},
timestamp = {Thu, 22 Sep 2022 17:53:14 +0200},
biburl = {https://dblp.org/rec/journals/tacl/NanHMLVZKSKTMRT22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
2021
Testing Cross-Database Semantic Parsers Using Canonical Utterances. Heather Lent, Semih Yavuz, Tao Yu, Tong Niu, Yingbo Zhou, Dragomir Radev, Xi Victoria Lin. EMNLP 2021 Workshop: Evaluation & Comparison of NLP Systems.
PDF
Abstract
Bibtex
Code
The benchmark performance of cross-database semantic parsing has climbed steadily in recent years, catalyzed by the wide adoption of pre-trained language models. Yet existing work have shown that state-of-the-art cross-database semantic parsers struggle to generalize to novel user utterances, databases and query structures. To obtain transparent details on the strengths and limitation of these models, we propose a diagnostic testing approach based on controlled synthesis of canonical natural language and SQL pairs. Inspired by the CheckList, we characterize a set of essential capabilities for cross-database semantic parsing models, and detailed the method for synthesizing the corresponding test data. We evaluated a variety of high performing models using the proposed approach, and identified several non-obvious weaknesses across models (e.g. unable to correctly select many columns). Our dataset and code are released as a test suite at http://github.com/hclent/BehaviorCheckingSemPar.
@inproceedings{lent-etal-2021-testing,
title = "Testing Cross-Database Semantic Parsers With Canonical Utterances",
author = "Lent, Heather and
Yavuz, Semih and
Yu, Tao and
Niu, Tong and
Zhou, Yingbo and
Radev, Dragomir and
Lin, Xi Victoria",
booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.eval4nlp-1.8",
doi = "10.18653/v1/2021.eval4nlp-1.8",
pages = "73--83",
abstract = "The benchmark performance of cross-database semantic parsing has climbed steadily in recent years, catalyzed by the wide adoption of pre-trained language models. Yet existing work have shown that state-of-the-art cross-database semantic parsers struggle to generalize to novel user utterances, databases and query structures. To obtain transparent details on the strengths and limitation of these models, we propose a diagnostic testing approach based on controlled synthesis of canonical natural language and SQL pairs. Inspired by the CheckList, we characterize a set of essential capabilities for cross-database semantic parsing models, and detailed the method for synthesizing the corresponding test data. We evaluated a variety of high performing models using the proposed approach, and identified several non-obvious weaknesses across models (e.g. unable to correctly select many columns). Our dataset and code are released as a test suite at http://github.com/hclent/BehaviorCheckingSemPar.",
}
Learning to Synthesize Data for Semantic Parsing. Bailin Wang, Wenpeng Yin, Xi Victoria Lin and Caiming Xiong. NAACL 2021 short.
PDF
Abstract
Bibtex
Code
Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.
@inproceedings{wang-etal-2021-learning-synthesize,
title = "Learning to Synthesize Data for Semantic Parsing",
author = "Wang, Bailin and
Yin, Wenpeng and
Lin, Xi Victoria and
Xiong, Caiming",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.naacl-main.220",
pages = "2760--2766",
abstract = "Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.",
}
DART: Open-Domain Structured Data Record to Text Generation. Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher and Nazneen Fatema Rajani. NAACL 2021.
PDF
Abstract
Bibtex
Code
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.
@inproceedings{nan-etal-2021-dart,
title = "{DART}: Open-Domain Structured Data Record to Text Generation",
author = "Nan, Linyong and
Radev, Dragomir and
Zhang, Rui and
Rau, Amrit and
Sivaprasad, Abhinand and
Hsieh, Chiachun and
Tang, Xiangru and
Vyas, Aadit and
Verma, Neha and
Krishna, Pranav and
Liu, Yangxiaokang and
Irwanto, Nadia and
Pan, Jessica and
Rahman, Faiaz and
Zaidi, Ahmad and
Mutuma, Mutethia and
Tarabar, Yasin and
Gupta, Ankit and
Yu, Tao and
Tan, Yi Chern and
Lin, Xi Victoria and
Xiong, Caiming and
Socher, Richard and
Rajani, Nazneen Fatema",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.naacl-main.37",
pages = "432--447",
abstract = "We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.",
}
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, Caiming Xiong. ICLR 2021.
PDF
Abstract
Bibtex
HuggingFace
We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG) induced from existing text-to-SQL datasets. We pre-train our model on the synthetic data using a novel text-schema linking objective that predicts the syntactic role of a table field in the SQL for each question-SQL pair. To maintain the model's ability to represent real-world data, we also include masked language modeling (MLM) over several existing table-and-language datasets to regularize the pre-training process. On four popular fully supervised and weakly supervised table semantic parsing benchmarks, GraPPa significantly outperforms RoBERTa-large as the feature representation layers and establishes new state-of-the-art results on all of them.
@article{DBLP:journals/corr/abs-2009-13845,
author = {Tao Yu and
Chien{-}Sheng Wu and
Xi Victoria Lin and
Bailin Wang and
Yi Chern Tan and
Xinyi Yang and
Dragomir R. Radev and
Richard Socher and
Caiming Xiong},
title = {GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing},
journal = {CoRR},
volume = {abs/2009.13845},
year = {2020},
url = {https://arxiv.org/abs/2009.13845},
archivePrefix = {arXiv},
eprint = {2009.13845},
timestamp = {Wed, 12 May 2021 16:44:19 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2009-13845.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
NeurIPS 2020 NLC2CMD Competition: Translating Natural Language to Bash Commands. Mayank Agarwal, Tathagata Chakraborti, Quchen Fu, David Gros, Xi Victoria Lin, Jaron Maene, Kartik Talamadupula, Zhongwei Teng, Jules White. PMLR post proceedings volume associated to the Competition Track @ NeurIPS2020.
PDF
Abstract
Bibtex
Leaderboard
The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax. This is a report on the competition with details of the task, metrics, data, attempted solutions, and lessons learned.
@article{DBLP:journals/corr/abs-2103-02523,
author = {Mayank Agarwal and
Tathagata Chakraborti and
Quchen Fu and
David Gros and
Xi Victoria Lin and
Jaron Maene and
Kartik Talamadupula and
Zhongwei Teng and
Jules White},
title = {NeurIPS 2020 {NLC2CMD} Competition: Translating Natural Language to
Bash Commands},
journal = {CoRR},
volume = {abs/2103.02523},
year = {2021},
url = {https://arxiv.org/abs/2103.02523},
archivePrefix = {arXiv},
eprint = {2103.02523},
timestamp = {Thu, 04 Mar 2021 17:00:40 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2103-02523.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
2020
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing. Xi Victoria Lin, Richard Socher, Caiming Xiong. EMNLP 2020 Findings.
PDF
Abstract
Bibtex
Slides
We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the text-DB contextualization is realized via the fine-tuned deep attention in BERT. Combined with a pointergenerator decoder with schema-consistency driven search space pruning, BRIDGE attained state-of-the-art performance on the well-studied Spider benchmark (65.5% dev, 59.2% test), despite being much simpler than most recently proposed models for this task. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks. Our implementation is available at https://github.com/ salesforce/TabularSemanticParsing.
@inproceedings{DBLP:conf/emnlp/LinSX20,
author = {Xi Victoria Lin and
Richard Socher and
Caiming Xiong},
editor = {Trevor Cohn and
Yulan He and
Yang Liu},
title = {Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic
Parsing},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: Findings, {EMNLP} 2020, Online Event, 16-20 November
2020},
pages = {4870--4888},
publisher = {Association for Computational Linguistics},
year = {2020},
url = {https://www.aclweb.org/anthology/2020.findings-emnlp.438/},
timestamp = {Thu, 12 Nov 2020 17:18:16 +0100},
biburl = {https://dblp.org/rec/conf/emnlp/LinSX20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
ColloQL: Robust Cross-Domain Text-to-SQL Over Search Queries. Karthik Radhakrishnan, Arvind Srikantan, Xi Victoria Lin. EMNLP 2020 Workshop: Interactive and Executable Semantic Parsing.
PDF
Abstract
Bibtex
Code
Translating natural language utterances to executable queries is a helpful technique in making the vast amount of data stored in relational databases accessible to a wider range of non-tech-savvy end users. Prior work in this area has largely focused on textual input that is linguistically correct and semantically unambiguous. However, real-world user queries are often succinct, colloquial, and noisy, resembling the input of a search engine. In this work, we introduce data augmentation techniques and a sampling-based content-aware BERT model (ColloQL) to achieve robust text-to-SQL modeling over natural language search (NLS) questions. Due to the lack of evaluation data, we curate a new dataset of NLS questions and demonstrate the efficacy of our approach. ColloQL's superior performance extends to well-formed text, achieving 84.9\% (logical) and 90.7\% (execution) accuracy on the WikiSQL dataset, making it, to the best of our knowledge, the highest performing model that does not use execution guided decoding.
@article{DBLP:journals/corr/abs-2010-09927,
author = {Karthik Radhakrishnan and
Arvind Srikantan and
Xi Victoria Lin},
title = {ColloQL: Robust Cross-Domain Text-to-SQL Over Search Queries},
journal = {CoRR},
volume = {abs/2010.09927},
year = {2020},
url = {https://arxiv.org/abs/2010.09927},
eprinttype = {arXiv},
eprint = {2010.09927},
timestamp = {Mon, 26 Oct 2020 15:39:44 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2010-09927.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Photon: A Robust Cross-Domain Text-to-SQL System. Jichuan Zeng*, Xi Victoria Lin*, Caiming Xiong, Richard Socher, Michael R. Lyu, Irwin King, Steven C.H. Hoi. ACL 2020 System Demonstration.
PDF
Abstract
Bibtex
Blog
Press
Natural language interfaces to databases (NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present Photon, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping cannot be immediately determined. Photon consists of a strong neural semantic parser (63.2\% structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector is a discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input. The live demo of our system is available at http://www.naturalsql.com.
@inproceedings{zeng-etal-2020-photon,
title = "{P}hoton: A Robust Cross-Domain Text-to-{SQL} System",
author = "Zeng, Jichuan and
Lin, Xi Victoria and
Xiong, Caiming and
Socher, Richard and
Lyu, Michael and
King, Irwin and
Hoi, Steven C.H."
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-demos.24",
pages = "204--214"
}
Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation. Tianlu Wang, Xi Victoria Lin, Nazeen Fatema Rajani, Bryan McCann, Vicente Ordonez and Caiming Xiong. ACL 2020.
PDF
Abstract
Bibtex
Blog
Press
Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.
@InProceedings{Wang2020:double_hard_debias,
author = {Tianlu Wang, Xi Victoria Lin, Nazeen Fatema Rajani, Bryan McCann, Vicente Ordonez and Caiming Xiong},
title = {Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation},
booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year = {2020},
address = {Seattle, Washington, USA},
publisher = {Association for Computational Linguistics}
}
2019
CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases. Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga, Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vincent Zhang, Caiming Xiong, Richard Socher, Walter Lasecki and Dragomir Radev EMNLP 2019.
PDF
Abstract
Bibtex
Leaderboard
We present CoSQL, a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions. When user questions are answerable by SQL, the expert describes the SQL and execution results to the user, hence maintaining a natural interaction flow. CoSQL introduces new challenges compared to existing task-oriented dialogue datasets: (1) the dialogue states are grounded in SQL, a domain-independent executable representation, instead of domain-specific slot-value pairs, and (2) because testing is done on unseen databases, success requires generalizing to new domains. CoSQL includes three tasks: SQL-grounded dialogue state tracking, response generation from query results, and user dialogue act prediction. We evaluate a set of strong baselines for each task and show that CoSQL presents significant challenges for future research.
@inproceedings{Yu2019:cosql,
author = {Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga, Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vincent Zhang, Caiming Xiong, Richard Socher, Walter Lasecki and Dragomir Radev},
title = {CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing, {EMNLP} 2019, Hong Kong, November 3-November 7, 2019},
year = {2019}
}
Editing-based SQL Query Generation for Cross-Domain Context-Dependent Questions. Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim, Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, Richard Socher and Dragomir Radev. EMNLP 2019.
PDF
Abstract
Bibtex
Code
We focus on the cross-domain context-dependent text-to-SQL generation task. Based on the observation that adjacent natural language questions are often linguistically dependent and their corresponding SQL queries tend to overlap, we utilize the interaction history by editing the previous predicted query to improve the generation quality. Our editing mechanism views SQL as sequences and reuses generation results at the token level in a simple manner. It is flexible to change individual tokens and robust to error propagation. Furthermore, to deal with complex table structures in different domains, we employ an utterance-table encoder and a table-aware decoder to incorporate the context of the user utterance and the table schema. We evaluate our approach on the SParC dataset and demonstrate the benefit of editing compared with the state-of-the-art baselines which generate SQL from scratch.
@inproceedings{Zhang2019:Editing,
author = {Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim, Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, Richard Socher and Dragomir Radev},
title = {Editing-based SQL Query Generation for Cross-Domain Context-Dependent Questions},
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing, {EMNLP} 2019, Hong Kong, November 3-November 7, 2019},
year = {2019}
}
SParC: Cross-Domain Semantic Parsing in Context. Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, Dragomir Radev. ACL 2019.
PDF
Abstract
Bibtex
Leaderboard
We present SParC, a dataset for cross-domain Semantic Parsing in Context. It consists of 4,298 coherent question sequences (12k+ individual questions annotated with SQL queries), obtained from controlled user interactions with 200 complex databases over 138 domains. We provide an in-depth analysis of SParC and show that it introduces new challenges compared to existing datasets. SParC (1) demonstrates complex contextual dependencies, (2) has greater semantic diversity, and (3) requires generalization to new domains due to its cross-domain nature and the unseen databases at test time. We experiment with two state-of-the-art text-to-SQL models adapted to the context-dependent, cross-domain setup. The best model obtains an exact match accuracy of 20.2% over all questions and less than 10% over all interaction sequences, indicating that the cross-domain setting and the contextual phenomena of the dataset present significant challenges for future research.
@InProceedings{Yu2019:sparc,
author = {Tao Yu and Rui Zhang and Michihiro Yasunaga and Yi Chern Tan and Xi Victoria Lin and Suyi Li and Heyang Er, Irene Li and Bo Pang and Tao Chen and Emily Ji and Shreya Dixit and David Proctor and Sungrok Shim and Jonathan Kraft, Vincent Zhang and Caiming Xiong and Richard Socher and Dragomir Radev},
title = {SParC: Cross-Domain Semantic Parsing in Context},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
year = {2019},
address = {Florence, Italy},
publisher = {Association for Computational Linguistics}
}
2018 and Before
Multi-Hop Knowledge Graph Reasoning with Reward Shaping. Xi Victoria Lin, Richard Socher and Caiming Xiong. EMNLP 2018.
PDF
Abstract
Bibtex
Talk
Slides
Multi-hop reasoning is an effective approach for query answering (QA) over incomplete knowledge graphs (KGs). The problem can be formulated in a reinforcement learning (RL) setup, where a policy-based agent sequentially extends its inference path until it reaches a target. However, in an incomplete KG environment, the agent receives low-quality rewards corrupted by false negatives in the training data, which harms generalization at test time. Furthermore, since no golden action sequence is used for training, the agent can be misled by spurious search trajectories that incidentally lead to the correct answer. We propose two modeling advances to address both issues: (1) we reduce the impact of false negative supervision by adopting a pretrained one-hop embedding model to estimate the reward of unobserved facts; (2) we counter the sensitivity to spurious paths of on-policy RL by forcing the agent to explore a diverse set of paths using randomly generated edge masks. Our approach significantly improves over existing path-based KGQA models on several benchmark datasets and is comparable or better than embedding-based models.
@inproceedings{LinRX2018:MultiHopKG,
author = {Xi Victoria Lin and Richard Socher and Caiming Xiong},
title = {Multi-Hop Knowledge Graph Reasoning with Reward Shaping},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, {EMNLP} 2018, Brussels, Belgium, October
31-November 4, 2018},
year = {2018}
}
NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer and Michael D. Ernst. LREC 2018.
PDF
Abstract
Bibtex
Slides
We present new data and semantic parsing methods for the problem of mapping english sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to easily solve otherwise repetitive tasks (such as file manipulation, search, and application-specific scripting) by simply stating their intents in English. We take a first step in this domain, by providing a large new dataset of challenging but commonly used commands paired with their English descriptions, along with the baseline methods to establish performance levels on this task.
@inproceedings{LinWZE2018:NL2Bash,
author = {Xi Victoria Lin and Chenglong Wang and Luke Zettlemoyer and Michael D. Ernst},
title = {NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources
and Evaluation {LREC} 2018, Miyazaki (Japan), 7-12 May, 2018.},
year = {2018}
}
Program Synthesis from Natural Language Using Recurrent Neural Networks. Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, Michael D. Ernst. University of Washington Department of Computer Science and Engineering Technical Report 2017.
PDF
Abstract
Bibtex
Tellina Tool
Even if a competent programmer knows what she wants to do and can describe it in English, it can still be difficult to write code to achieve the goal. Existing resources, such as question-and-answer websites, tabulate specific operations that someone has wanted to perform in the past, but they are not effective in generalizing to new tasks, to compound tasks that require combining previous questions, or sometimes even to variations of listed tasks.
Our goal is to make programming easier and more productive by letting programmers use their own words and concepts to express the intended operation, rather than forcing them to accommodate the machine by memorizing its grammar. We have built a system that lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language for review and approval by the programmer. Our system, Tellina, does the translation using recurrent neural networks (RNNs), a state-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements.
We evaluated Tellina in the context of shell scripting. We trained Tellina's RNNs on textual descriptions of file system operations and bash one-liners, scraped from the web. Although recovering completely correct commands is challenging, Tellina achieves top-3 accuracy of 80% for producing the correct command structure. In a controlled study, programmers who had access to Tellina outperformed those who did not, even when Tellina's predictions were not completely correct, to a statistically significant degree.
@techreport{LinWPVZE2017:TR,
author = {Xi Victoria Lin and Chenglong Wang and Deric Pang and Kevin Vu and Luke Zettlemoyer and Michael D. Ernst},
title = {Program synthesis from natural language using recurrent neural networks},
institution = {University of Washington Department of Computer Science and Engineering},
number = {UW-CSE-17-03-01},
address = {Seattle, WA, USA},
month = mar,
year = {2017}
}
Compositional Learning of Embeddings for Relation Paths in Knowledge Bases and Text. Kristina Toutanova, Xi Victoria Lin, Scott Wen-tau Yih, Hoifung Poon and Chris Quirk. ACL 2016.
PDF
Abstract
Bibtex
Modeling relation paths has offered significant gains in embedding models for knowledge base (KB) completion. However, enumerating paths between two entities is very expensive, and existing approaches typically resort to approximation with a sampled subset. This problem is particularly acute when text is jointly modeled with KB relations and used to provide direct evidence for facts mentioned in it. In this paper, we propose the first exact dynamic programming algorithm which enables efficient incorporation of all relation paths of bounded length, while modeling both relation types and intermediate nodes in the compositional path representations. We conduct a theoretical analysis of the efficiency gain from the approach. Experiments on two datasets show that it addresses representational limitations in prior approaches and improves accuracy in KB completion.
@inproceedings{DBLP:conf/acl/ToutanovaLYPQ16,
author = {Kristina Toutanova and
Victoria Lin and
Wen{-}tau Yih and
Hoifung Poon and
Chris Quirk},
title = {Compositional Learning of Embeddings for Relation Paths in Knowledge
Base and Text},
booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume
1: Long Papers},
year = {2016},
crossref = {DBLP:conf/acl/2016-1},
url = {http://aclweb.org/anthology/P/P16/P16-1136.pdf},
timestamp = {Mon, 15 Aug 2016 20:10:51 +0200},
biburl = {http://dblp.org/rec/bib/conf/acl/ToutanovaLYPQ16},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
@proceedings{DBLP:conf/acl/2016-1,
title = {Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume
1: Long Papers},
publisher = {The Association for Computer Linguistics},
year = {2016},
url = {http://aclanthology.info/volumes/proceedings-of-the-54th-annual-meeting-of-the-association-for-computational-linguistics-volume-1-long-papers},
isbn = {978-1-945626-00-5},
timestamp = {Mon, 15 Aug 2016 15:53:28 +0200},
biburl = {http://dblp.org/rec/bib/conf/acl/2016-1},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
Multi-label Learning with Posterior Regularization. Xi Victoria Lin, Sameer Singh, Luheng He, Ben Taskar, and Luke Zettlemoyer. NeurIPS 2014 Workshop: Modern Machine Learning and NLP.
PDF
Abstract
Bibtex
In many multi-label learning problems, especially as the number of labels grow, it is challenging to gather completely annotated data. This work presents a new approach for multi-label learning from incomplete annotations. The main assumption is that because of label correlation, the true label matrix as well as the soft predictions of classifiers shall be approximately low rank. We introduce a posterior regularization technique which enforces soft constraints on the classifiers, regularizing them to prefer sparse and low-rank predictions. Avoiding strict low-rank constraints results in classifiers which better fit the real data. The model can be trained efficiently using EM and stochastic gradient descent. Experiments in both the image and text domains demonstrate the contributions of each modeling assumption and show that the proposed approach achieves state-of-the-art performance on a number of challenging datasets..
@InProceedings{lin14_prlr,
author = {Xi Victoria Lin and Sameer Singh and Luheng He and Ben Taskar and Luke Zettlemoyer},
title = {Multi-label Learning with Posterior Regularization},
booktitle = {NeurIPS Workshop on Modern Machine Learning and Natural Language Processing},
year = 2014,
month = 12,
address={Montreal, Quebec, CA},
url={http://homes.cs.washington.edu/~xilin/pubs/mlnlp2014.pdf}
}
I was a PhD student of the late Ben Taskar. The Taskar Center for Accessible Technology (TCAT) was lauched by Anat Caspi in November, 2014. I am excited about its mission. Anat's expertise and unique perspective would lead to accessible technologies that could change the life for many.
I'm fascinated by different kinds of puzzles. At some point I tried to make a few: Sea Virus, Chocolate Crush.