Victoria Lin 林曦

Research Scientist
Meta

1 Hacker Way, Menlo Park, California

victorialin@meta.com

About Me

I am a research scientist at Foundational AI Research (FAIR), Meta. I am passionate about building general intelligent systems that process information at scale and assist humans in various knowledge-intensive tasks.

Previously I was a senior research scientist at Salesforce Research. I obtained my PhD from the Paul G. Allen School of Computer Science & Engineering, University of Washington, advised by Luke Zettlemoyer. I was co-advised by Michael D. Ernst on code generation with neural networks.

Please refer to my CV for a comprehensive overview of my experience.

Project Highlights

Large-scale causal language models have demonstrated impressive few-shot learning capabilities. These models have been primarily built for English and a few other high-resource languages. Given there are over 7,000 languages in the world, developing language models for each of them is expensive and neglects the positive transfer between related languages. We address this problem by training multilingual language models (XGLMs) on a mixture of diverse languages, where significant presense of the lower-resourced languages is achieved via up-sampling. Our largest model with 7.5 billion parameters enabled few-shot learning in 20+ languages on text completion and language inference tasks. It also demonstrates strong cross-lingual transfer and sets new state-of-the-art in few-short machine translation in the lower-resourced regime. [ ArXiv'21]

Tellina is an end-user scripting assistant that can be queried via natural language. It translates a natural language sentence typed by the user into a piece of short, executable script. The underlying models are neural encoder-decoders trained on NL-script pairs collected by programming experts from online tutorials and question-answering forums. We instantiate the prototype in Bash.
This work poses several challenges including scalable data collection, never-ending learning and personalization, most of which are central to all practical semantic parsing systems. [ LREC'18, UW-CSE-TR'17]

Publications
* Equal Contribution

Please refer to my Google Scholar page for a complete list of publications.

Preprints

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts.
Xi Victoria Lin*, Akshat Shrivastava*, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan*
ArXiv 2024.
PDF Abstract Bibtex alphaXiv

@misc{lin2024momaefficientearlyfusionpretraining,
      title={MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts}, 
      author={Xi Victoria Lin and Akshat Shrivastava and Liang Luo and Srinivasan Iyer and Mike Lewis and Gargi Ghosh and Luke Zettlemoyer and Armen Aghajanyan},
      year={2024},
      eprint={2407.21770},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.21770}, 
}

Chameleon: Mixed-Modal Early-Fusion Foundation Models.
Chameleon Team
ArXiv 2024.
PDF Abstract Bibtex Checkpoints & Code

@misc{chameleonteam2024chameleonmixedmodalearlyfusionfoundation,
      title={Chameleon: Mixed-Modal Early-Fusion Foundation Models}, 
      author={Chameleon Team},
      year={2024},
      eprint={2405.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.09818}, 
}

2024

NEST: Nearest Neighbor Speculative Decoding for LLM Generation and Attribution.
Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Scott Wen-tau Yih, Xi Victoria Lin
NeurIPS 2024.
PDF Abstract Bibtex

@misc{li2024nearestneighborspeculativedecoding,
      title={Nearest Neighbor Speculative Decoding for LLM Generation and Attribution},
      author={Minghan Li and Xilun Chen and Ari Holtzman and Beidi Chen and Jimmy Lin and Wen-tau Yih and Xi Victoria Lin},
      year={2024},
      eprint={2405.19325},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.19325},
}

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.
Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Shang-Wen Daniel Li, Scott Wen-tau Yih, Jason Weston, Xian Li
COLM 2024.
PDF Abstract Bibtex

@misc{sukhbaatar2024branchtrainmixmixingexpertllms,
      title={Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM}, 
      author={Sainbayar Sukhbaatar and Olga Golovneva and Vasu Sharma and Hu Xu and Xi Victoria Lin and Baptiste Rozière and Jacob Kahn and Daniel Li and Wen-tau Yih and Jason Weston and Xian Li},
      year={2024},
      eprint={2403.07816},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2403.07816}, 
}

Instruction-tuned Language Models are Better Knowledge Learners.
Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Scott Wen-tau Yih, Srinivasan Iyer
ACL 2024.
PDF Abstract Bibtex

@inproceedings{jiang-etal-2024-instruction,
    title = "Instruction-tuned Language Models are Better Knowledge Learners",
    author = "Jiang, Zhengbao  and
      Sun, Zhiqing  and
      Shi, Weijia  and
      Rodriguez, Pedro  and
      Zhou, Chunting  and
      Neubig, Graham  and
      Lin, Xi  and
      Yih, Wen-tau  and
      Iyer, Srini",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.296",
    pages = "5421--5434",
    abstract = "In order for large language model (LLM)-based assistants to effectively adapt to evolving information needs, it must be possible to update their factual knowledge through continued training on new data. The standard recipe for doing so involves continued pre-training on new documents followed by instruction-tuning on question-answer (QA) pairs. However, we find that LLMs trained with this recipe struggle to answer questions, even though the perplexity of documents is minimized. We found that QA pairs are generally straightforward, while documents are more complex, weaving many factual statements together in an intricate manner. Therefore, we hypothesize that it is beneficial to expose LLMs to QA pairs before continued pre-training on documents so that the process of encoding knowledge from complex documents takes into account how this knowledge is accessed through questions. Based on this, we propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents. This contrasts with standard instruction-tuning, which learns how to extract knowledge after training on documents. Extensive experiments and ablation studies demonstrate that pre-instruction-tuning significantly enhances the ability of LLMs to absorb knowledge from new documents, outperforming standard instruction-tuning by 17.8{\%}.",
}

RA-DIT: Retrieval-Augmented Dual Instruction Tuning.
Xi Victoria Lin*, Xilun Chen*, Mingda Chen*, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Scott Wen-tau Yih
ICLR 2024.
PDF Abstract Bibtex

LlamaIndex

@inproceedings{DBLP:conf/iclr/Lin0CSL00KSLZY24,
  author       = {Xi Victoria Lin and
                  Xilun Chen and
                  Mingda Chen and
                  Weijia Shi and
                  Maria Lomeli and
                  Richard James and
                  Pedro Rodriguez and
                  Jacob Kahn and
                  Gergely Szilvasy and
                  Mike Lewis and
                  Luke Zettlemoyer and
                  Wen{-}tau Yih},
  title        = {{RA-DIT:} Retrieval-Augmented Dual Instruction Tuning},
  booktitle    = {The Twelfth International Conference on Learning Representations,
                  {ICLR} 2024, Vienna, Austria, May 7-11, 2024},
  publisher    = {OpenReview.net},
  year         = {2024},
  url          = {https://openreview.net/forum?id=22OTbutug9},
  timestamp    = {Wed, 07 Aug 2024 17:11:53 +0200},
  biburl       = {https://dblp.org/rec/conf/iclr/Lin0CSL00KSLZY24.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

In-Context Pretraining: Language Modeling Beyond Document Boundaries.
Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Wen-tau Yih, Mike Lewis
ICLR 2024.
PDF Abstract Bibtex

@misc{shi2023incontext,
      title={In-Context Pretraining: Language Modeling Beyond Document Boundaries},
      author={Weijia Shi and Sewon Min and Maria Lomeli and Chunting Zhou and Margaret Li and Rich James and Xi Victoria Lin and Noah A. Smith and Luke Zettlemoyer and Scott Yih and Mike Lewis},
      year={2023},
      eprint={2310.10638},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

2023

Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model.
Leo Z. Liu, Tim Dettmers, Xi Victoria Lin, Veselin Stoyanov, Xian Li.
EMNLP 2023.
PDF Abstract Bibtex

@misc{liu2023unified,
      title={Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model},
      author={Leo Z. Liu and Tim Dettmers and Xi Victoria Lin and Veselin Stoyanov and Xian Li},
      year={2023},
      eprint={2305.13999},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

LEVER: Learning to Verify Language-to-Code Generation with Execution.
Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Scott Wen-tau Yih, Sida I. Wang*, Xi Victoria Lin*.
ICML 2023.
PDF Abstract Bibtex

@inproceedings{DBLP:conf/icml/Ni0RSYWL23,
  author       = {Ansong Ni and
                  Srini Iyer and
                  Dragomir Radev and
                  Veselin Stoyanov and
                  Wen{-}Tau Yih and
                  Sida I. Wang and
                  Xi Victoria Lin},
  editor       = {Andreas Krause and
                  Emma Brunskill and
                  Kyunghyun Cho and
                  Barbara Engelhardt and
                  Sivan Sabato and
                  Jonathan Scarlett},
  title        = {{LEVER:} Learning to Verify Language-to-Code Generation with Execution},
  booktitle    = {International Conference on Machine Learning, {ICML} 2023, 23-29 July
                  2023, Honolulu, Hawaii, {USA}},
  series       = {Proceedings of Machine Learning Research},
  volume       = {202},
  pages        = {26106--26128},
  publisher    = {{PMLR}},
  year         = {2023},
  url          = {https://proceedings.mlr.press/v202/ni23b.html},
  timestamp    = {Mon, 28 Aug 2023 17:23:08 +0200},
  biburl       = {https://dblp.org/rec/conf/icml/Ni0RSYWL23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Training Trajectories of Language Models Across Scales .
Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov.
ACL 2023.
PDF Abstract Bibtex

@inproceedings{DBLP:conf/acl/XiaAZLPCZS23,
  author       = {Mengzhou Xia and
                  Mikel Artetxe and
                  Chunting Zhou and
                  Xi Victoria Lin and
                  Ramakanth Pasunuru and
                  Danqi Chen and
                  Luke Zettlemoyer and
                  Veselin Stoyanov},
  editor       = {Anna Rogers and
                  Jordan L. Boyd{-}Graber and
                  Naoaki Okazaki},
  title        = {Training Trajectories of Language Models Across Scales},
  booktitle    = {Proceedings of the 61st Annual Meeting of the Association for Computational
                  Linguistics (Volume 1: Long Papers), {ACL} 2023, Toronto, Canada,
                  July 9-14, 2023},
  pages        = {13711--13738},
  publisher    = {Association for Computational Linguistics},
  year         = {2023},
  url          = {https://doi.org/10.18653/v1/2023.acl-long.767},
  doi          = {10.18653/v1/2023.acl-long.767},
  timestamp    = {Thu, 10 Aug 2023 12:36:04 +0200},
  biburl       = {https://dblp.org/rec/conf/acl/XiaAZLPCZS23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Reimagining Retrieval Augmented Language Models for Answering Queries.
Wang-Chiew Tan, Yuliang Li, Pedro Rodriguez, Richard James, Xi Victoria Lin, Alon Halevy, Scott Wen-tau Yih.
ACL 2023 Findings.
PDF Abstract Bibtex

@inproceedings{DBLP:conf/acl/Tan0RJLHY23,
  author       = {Wang{-}Chiew Tan and
                  Yuliang Li and
                  Pedro Rodriguez and
                  Richard James and
                  Xi Victoria Lin and
                  Alon Y. Halevy and
                  Wen{-}tau Yih},
  editor       = {Anna Rogers and
                  Jordan L. Boyd{-}Graber and
                  Naoaki Okazaki},
  title        = {Reimagining Retrieval Augmented Language Models for Answering Queries},
  booktitle    = {Findings of the Association for Computational Linguistics: {ACL} 2023,
                  Toronto, Canada, July 9-14, 2023},
  pages        = {6131--6146},
  publisher    = {Association for Computational Linguistics},
  year         = {2023},
  url          = {https://doi.org/10.18653/v1/2023.findings-acl.382},
  doi          = {10.18653/v1/2023.findings-acl.382},
  timestamp    = {Thu, 17 Aug 2023 12:47:06 +0200},
  biburl       = {https://dblp.org/rec/conf/acl/Tan0RJLHY23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

2022

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization.
Srinivasan Iyer*, Xi Victoria Lin*, Ramakanth Pasunuru*, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, Ves Stoyanov
Technical Report 2022.
PDF Abstract Bibtex Checkpoints & Code

HuggingFace

@article{DBLP:journals/corr/abs-2212-12017,
  author    = {Srinivasan Iyer and
               Xi Victoria Lin and
               Ramakanth Pasunuru and
               Todor Mihaylov and
               Daniel Simig and
               Ping Yu and
               Kurt Shuster and
               Tianlu Wang and
               Qing Liu and
               Punit Singh Koura and
               Xian Li and
               Brian O'Horo and
               Gabriel Pereyra and
               Jeff Wang and
               Christopher Dewan and
               Asli Celikyilmaz and
               Luke Zettlemoyer and
               Ves Stoyanov},
  title     = {{OPT-IML:} Scaling Language Model Instruction Meta Learning through
               the Lens of Generalization},
  journal   = {CoRR},
  volume    = {abs/2212.12017},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2212.12017},
  doi       = {10.48550/arXiv.2212.12017},
  eprinttype = {arXiv},
  eprint    = {2212.12017},
  timestamp = {Wed, 04 Jan 2023 16:01:37 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2212-12017.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

OPT: Open Pre-trained Transformer Language Models.
Susan Zhang*, Stephen Roller*, Naman Goyal*, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer.
Technical Report 2022.
PDF Abstract Bibtex Blog Checkpoints & Code

HuggingFace

@article{DBLP:journals/corr/abs-2205-01068,
  author    = {Susan Zhang and
               Stephen Roller and
               Naman Goyal and
               Mikel Artetxe and
               Moya Chen and
               Shuohui Chen and
               Christopher Dewan and
               Mona T. Diab and
               Xian Li and
               Xi Victoria Lin and
               Todor Mihaylov and
               Myle Ott and
               Sam Shleifer and
               Kurt Shuster and
               Daniel Simig and
               Punit Singh Koura and
               Anjali Sridhar and
               Tianlu Wang and
               Luke Zettlemoyer},
  title     = {{OPT:} Open Pre-trained Transformer Language Models},
  journal   = {CoRR},
  volume    = {abs/2205.01068},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2205.01068},
  doi       = {10.48550/arXiv.2205.01068},
  eprinttype = {arXiv},
  eprint    = {2205.01068},
  timestamp = {Thu, 22 Sep 2022 19:27:06 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2205-01068.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Few-shot Learning with Multilingual Language Models.
Xi Victoria Lin*, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li*.
EMNLP 2022.
PDF Abstract Bibtex Checkpoints & Code

HuggingFace

@article{DBLP:journals/corr/abs-2112-10668,
  author    = {Xi Victoria Lin and
               Todor Mihaylov and
               Mikel Artetxe and
               Tianlu Wang and
               Shuohui Chen and
               Daniel Simig and
               Myle Ott and
               Naman Goyal and
               Shruti Bhosale and
               Jingfei Du and
               Ramakanth Pasunuru and
               Sam Shleifer and
               Punit Singh Koura and
               Vishrav Chaudhary and
               Brian O'Horo and
               Jeff Wang and
               Luke Zettlemoyer and
               Zornitsa Kozareva and
               Mona T. Diab and
               Veselin Stoyanov and
               Xian Li},
  title     = {Few-shot Learning with Multilingual Language Models},
  journal   = {CoRR},
  volume    = {abs/2112.10668},
  year      = {2021},
  url       = {https://arxiv.org/abs/2112.10668},
  eprinttype = {arXiv},
  eprint    = {2112.10668},
  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Efficient Large Scale Language Modeling with Mixtures of Experts.
Mikel Artetxe*, Shruti Bhosale*, Naman Goyal*, Todor Mihaylov*, Myle Ott*, Sam Shleifer*, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov.
EMNLP 2022.
PDF Abstract Bibtex Checkpoints & Code

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ∼4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.

@article{DBLP:journals/corr/abs-2112-10684,
  author    = {Mikel Artetxe and
               Shruti Bhosale and
               Naman Goyal and
               Todor Mihaylov and
               Myle Ott and
               Sam Shleifer and
               Xi Victoria Lin and
               Jingfei Du and
               Srinivasan Iyer and
               Ramakanth Pasunuru and
               Giri Anantharaman and
               Xian Li and
               Shuohui Chen and
               Halil Akin and
               Mandeep Baines and
               Louis Martin and
               Xing Zhou and
               Punit Singh Koura and
               Brian O'Horo and
               Jeff Wang and
               Luke Zettlemoyer and
               Mona T. Diab and
               Zornitsa Kozareva and
               Ves Stoyanov},
  title     = {Efficient Large Scale Language Modeling with Mixtures of Experts},
  journal   = {CoRR},
  volume    = {abs/2112.10684},
  year      = {2021},
  url       = {https://arxiv.org/abs/2112.10684},
  eprinttype = {arXiv},
  eprint    = {2112.10684},
  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-10684.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Lifting the Curse of Multilinguality by Pre-training Modular Transformers.
Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe.
NAACL 2022.
PDF Abstract Bibtex

Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-Mod) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.

@inproceedings{DBLP:conf/naacl/PfeifferGLLC0A22,
  author    = {Jonas Pfeiffer and
               Naman Goyal and
               Xi Victoria Lin and
               Xian Li and
               James Cross and
               Sebastian Riedel and
               Mikel Artetxe},
  editor    = {Marine Carpuat and
               Marie{-}Catherine de Marneffe and
               Iv{\'{a}}n Vladimir Meza Ru{\'{\i}}z},
  title     = {Lifting the Curse of Multilinguality by Pre-training Modular Transformers},
  booktitle = {Proceedings of the 2022 Conference of the North American Chapter of
               the Association for Computational Linguistics: Human Language Technologies,
               {NAACL} 2022, Seattle, WA, United States, July 10-15, 2022},
  pages     = {3479--3495},
  publisher = {Association for Computational Linguistics},
  year      = {2022},
  url       = {https://doi.org/10.18653/v1/2022.naacl-main.255},
  doi       = {10.18653/v1/2022.naacl-main.255},
  timestamp = {Mon, 01 Aug 2022 16:28:01 +0200},
  biburl    = {https://dblp.org/rec/conf/naacl/PfeifferGLLC0A22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

On Continual Model Refinement in Out-of-Distribution Data Streams.
Bill Yuchen Lin, Sida Wang, Xi Victoria Lin, Robin Jia, Lin Xiao, Xiang Ren, Scott Wen-tau Yih.
ACL 2022.
PDF Abstract Bibtex

Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams while overcoming catastrophic forgetting. However, existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario. In response to this, we propose a new CL problem formulation dubbed continual model refinement (CMR). Compared to prior CL settings, CMR is more practical and introduces unique challenges (boundary-agnostic and non-stationary distribution shift, diverse mixtures of multiple OOD data clusters, error-centric streams, etc.). We extend several existing CL approaches to the CMR setting and evaluate them extensively. For benchmarking and analysis, we propose a general sampling algorithm to obtain dynamic OOD data streams with controllable non-stationarity, as well as a suite of metrics measuring various aspects of online performance. Our experiments and detailed analysis reveal the promise and challenges of the CMR problem, supporting that studying CMR in dynamic OOD streams can benefit the longevity of deployed NLP models in production.

@inproceedings{DBLP:conf/acl/LinWLJXRY22,
  author    = {Bill Yuchen Lin and
               Sida Wang and
               Xi Victoria Lin and
               Robin Jia and
               Lin Xiao and
               Xiang Ren and
               Scott Yih},
  editor    = {Smaranda Muresan and
               Preslav Nakov and
               Aline Villavicencio},
  title     = {On Continual Model Refinement in Out-of-Distribution Data Streams},
  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational
               Linguistics (Volume 1: Long Papers), {ACL} 2022, Dublin, Ireland,
               May 22-27, 2022},
  pages     = {3128--3139},
  publisher = {Association for Computational Linguistics},
  year      = {2022},
  url       = {https://doi.org/10.18653/v1/2022.acl-long.223},
  doi       = {10.18653/v1/2022.acl-long.223},
  timestamp = {Mon, 01 Aug 2022 16:27:42 +0200},
  biburl    = {https://dblp.org/rec/conf/acl/LinWLJXRY22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card Messages.
Jiao Sun, Tongshuang Wu, Yue Jiang, Ronil Awalegaonkar, Xi Victoria Lin, Diyi Yang.
CHI 2022.
PDF Abstract Bibtex

People write personalized greeting cards on various occasions. While prior work has studied gender roles in greeting card messages, systematic analysis at scale and tools for raising the awareness of gender stereotyping remain under-investigated. To this end, we collect a large greeting card message corpus covering three different occasions (birthday, Valentine's Day and wedding) from three sources (exemplars from greeting message websites, real-life greetings from social media and language model generated ones). We uncover a wide range of gender stereotypes in this corpus via topic modeling, odds ratio and Word Embedding Association Test (WEAT). We further conduct a survey to understand people's perception of gender roles in messages from this corpus and if gender stereotyping is a concern. The results show that people want to be aware of gender roles in the messages, but remain unconcerned unless the perceived gender roles conflict with the recipient's true personality. In response, we developed GreetA, an interactive visualization and writing assistant tool to visualize fine-grained topics in greeting card messages drafted by the users and the associated gender perception scores, but without suggesting text changes as an intervention.

@inproceedings{DBLP:conf/chi/SunWJALY22,
  author    = {Jiao Sun and
               Tongshuang Wu and
               Yue Jiang and
               Ronil Awalegaonkar and
               Xi Victoria Lin and
               Diyi Yang},
  editor    = {Simone D. J. Barbosa and
               Cliff Lampe and
               Caroline Appert and
               David A. Shamma and
               Steven Mark Drucker and
               Julie R. Williamson and
               Koji Yatani},
  title     = {Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card
               Messages},
  booktitle = {{CHI} '22: {CHI} Conference on Human Factors in Computing Systems,
               New Orleans, LA, USA, 29 April 2022 - 5 May 2022},
  pages     = {398:1--398:15},
  publisher = {{ACM}},
  year      = {2022},
  url       = {https://doi.org/10.1145/3491102.3502114},
  doi       = {10.1145/3491102.3502114},
  timestamp = {Fri, 29 Apr 2022 17:07:24 +0200},
  biburl    = {https://dblp.org/rec/conf/chi/SunWJALY22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

FeTaQA: Free-form Table Question Answering.
Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kryściński, Nick Schoelkopf, Riley Kong, Xiangru Tang, Murori Mutuma, Ben Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, Dragomir Radev.
TACL 2022.
PDF Abstract Bibtex

Existing table question answering datasets contain abundant factual questions that primarily evaluate the query and schema comprehension capability of a system, but they fail to include questions that require complex reasoning and integration of information due to the constraint of the associated short-form answers. To address these issues and to demonstrate the full challenge of table question answering, we introduce FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs. FeTaQA yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source. Unlike datasets of generative QA over text in which answers are prevalent with copies of short text spans from the source, answers in our dataset are human-generated explanations involving entities and their high-level relations. We provide two benchmark methods for the proposed task: a pipeline method based on semantic-parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.

@article{DBLP:journals/tacl/NanHMLVZKSKTMRT22,
  author    = {Linyong Nan and
               Chiachun Hsieh and
               Ziming Mao and
               Xi Victoria Lin and
               Neha Verma and
               Rui Zhang and
               Wojciech Kryscinski and
               Hailey Schoelkopf and
               Riley Kong and
               Xiangru Tang and
               Mutethia Mutuma and
               Ben Rosand and
               Isabel Trindade and
               Renusree Bandaru and
               Jacob Cunningham and
               Caiming Xiong and
               Dragomir R. Radev},
  title     = {FeTaQA: Free-form Table Question Answering},
  journal   = {Trans. Assoc. Comput. Linguistics},
  volume    = {10},
  pages     = {35--49},
  year      = {2022},
  url       = {https://doi.org/10.1162/tacl\_a\_00446},
  doi       = {10.1162/tacl\_a\_00446},
  timestamp = {Thu, 22 Sep 2022 17:53:14 +0200},
  biburl    = {https://dblp.org/rec/journals/tacl/NanHMLVZKSKTMRT22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

2021

Testing Cross-Database Semantic Parsers Using Canonical Utterances.
Heather Lent, Semih Yavuz, Tao Yu, Tong Niu, Yingbo Zhou, Dragomir Radev, Xi Victoria Lin.
EMNLP 2021 Workshop: Evaluation & Comparison of NLP Systems.
PDF Abstract Bibtex

The benchmark performance of cross-database semantic parsing has climbed steadily in recent years, catalyzed by the wide adoption of pre-trained language models. Yet existing work have shown that state-of-the-art cross-database semantic parsers struggle to generalize to novel user utterances, databases and query structures. To obtain transparent details on the strengths and limitation of these models, we propose a diagnostic testing approach based on controlled synthesis of canonical natural language and SQL pairs. Inspired by the CheckList, we characterize a set of essential capabilities for cross-database semantic parsing models, and detailed the method for synthesizing the corresponding test data. We evaluated a variety of high performing models using the proposed approach, and identified several non-obvious weaknesses across models (e.g. unable to correctly select many columns). Our dataset and code are released as a test suite at http://github.com/hclent/BehaviorCheckingSemPar.

@inproceedings{lent-etal-2021-testing,
    title = "Testing Cross-Database Semantic Parsers With Canonical Utterances",
    author = "Lent, Heather  and
      Yavuz, Semih  and
      Yu, Tao  and
      Niu, Tong  and
      Zhou, Yingbo  and
      Radev, Dragomir  and
      Lin, Xi Victoria",
    booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.eval4nlp-1.8",
    doi = "10.18653/v1/2021.eval4nlp-1.8",
    pages = "73--83",
    abstract = "The benchmark performance of cross-database semantic parsing has climbed steadily in recent years, catalyzed by the wide adoption of pre-trained language models. Yet existing work have shown that state-of-the-art cross-database semantic parsers struggle to generalize to novel user utterances, databases and query structures. To obtain transparent details on the strengths and limitation of these models, we propose a diagnostic testing approach based on controlled synthesis of canonical natural language and SQL pairs. Inspired by the CheckList, we characterize a set of essential capabilities for cross-database semantic parsing models, and detailed the method for synthesizing the corresponding test data. We evaluated a variety of high performing models using the proposed approach, and identified several non-obvious weaknesses across models (e.g. unable to correctly select many columns). Our dataset and code are released as a test suite at http://github.com/hclent/BehaviorCheckingSemPar.",
}

Learning to Synthesize Data for Semantic Parsing.
Bailin Wang, Wenpeng Yin, Xi Victoria Lin and Caiming Xiong.
NAACL 2021 short.
PDF Abstract Bibtex

Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.

@inproceedings{wang-etal-2021-learning-synthesize,
    title = "Learning to Synthesize Data for Semantic Parsing",
    author = "Wang, Bailin  and
      Yin, Wenpeng  and
      Lin, Xi Victoria  and
      Xiong, Caiming",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.220",
    pages = "2760--2766",
    abstract = "Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.",
}

DART: Open-Domain Structured Data Record to Text Generation.
Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher and Nazneen Fatema Rajani.
NAACL 2021.
PDF Abstract Bibtex

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.

@inproceedings{nan-etal-2021-dart,
    title = "{DART}: Open-Domain Structured Data Record to Text Generation",
    author = "Nan, Linyong  and
      Radev, Dragomir  and
      Zhang, Rui  and
      Rau, Amrit  and
      Sivaprasad, Abhinand  and
      Hsieh, Chiachun  and
      Tang, Xiangru  and
      Vyas, Aadit  and
      Verma, Neha  and
      Krishna, Pranav  and
      Liu, Yangxiaokang  and
      Irwanto, Nadia  and
      Pan, Jessica  and
      Rahman, Faiaz  and
      Zaidi, Ahmad  and
      Mutuma, Mutethia  and
      Tarabar, Yasin  and
      Gupta, Ankit  and
      Yu, Tao  and
      Tan, Yi Chern  and
      Lin, Xi Victoria  and
      Xiong, Caiming  and
      Socher, Richard  and
      Rajani, Nazneen Fatema",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.37",
    pages = "432--447",
    abstract = "We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.",
}

GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing.
Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, Caiming Xiong.
ICLR 2021.
PDF Abstract Bibtex

HuggingFace

We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG) induced from existing text-to-SQL datasets. We pre-train our model on the synthetic data using a novel text-schema linking objective that predicts the syntactic role of a table field in the SQL for each question-SQL pair. To maintain the model's ability to represent real-world data, we also include masked language modeling (MLM) over several existing table-and-language datasets to regularize the pre-training process. On four popular fully supervised and weakly supervised table semantic parsing benchmarks, GraPPa significantly outperforms RoBERTa-large as the feature representation layers and establishes new state-of-the-art results on all of them.

@article{DBLP:journals/corr/abs-2009-13845,
  author    = {Tao Yu and
               Chien{-}Sheng Wu and
               Xi Victoria Lin and
               Bailin Wang and
               Yi Chern Tan and
               Xinyi Yang and
               Dragomir R. Radev and
               Richard Socher and
               Caiming Xiong},
  title     = {GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing},
  journal   = {CoRR},
  volume    = {abs/2009.13845},
  year      = {2020},
  url       = {https://arxiv.org/abs/2009.13845},
  archivePrefix = {arXiv},
  eprint    = {2009.13845},
  timestamp = {Wed, 12 May 2021 16:44:19 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2009-13845.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
                                      }

NeurIPS 2020 NLC2CMD Competition: Translating Natural Language to Bash Commands.
Mayank Agarwal, Tathagata Chakraborti, Quchen Fu, David Gros, Xi Victoria Lin, Jaron Maene, Kartik Talamadupula, Zhongwei Teng, Jules White.
PMLR post proceedings volume associated to the Competition Track @ NeurIPS2020.
PDF Abstract Bibtex Leaderboard

The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax. This is a report on the competition with details of the task, metrics, data, attempted solutions, and lessons learned.

@article{DBLP:journals/corr/abs-2103-02523,
  author    = {Mayank Agarwal and
               Tathagata Chakraborti and
               Quchen Fu and
               David Gros and
               Xi Victoria Lin and
               Jaron Maene and
               Kartik Talamadupula and
               Zhongwei Teng and
               Jules White},
  title     = {NeurIPS 2020 {NLC2CMD} Competition: Translating Natural Language to
               Bash Commands},
  journal   = {CoRR},
  volume    = {abs/2103.02523},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.02523},
  archivePrefix = {arXiv},
  eprint    = {2103.02523},
  timestamp = {Thu, 04 Mar 2021 17:00:40 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-02523.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
                                      }

2020

Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing.
Xi Victoria Lin, Richard Socher, Caiming Xiong.
EMNLP 2020 Findings.
PDF Abstract Bibtex Code

We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the text-DB contextualization is realized via the fine-tuned deep attention in BERT. Combined with a pointergenerator decoder with schema-consistency driven search space pruning, BRIDGE attained state-of-the-art performance on the well-studied Spider benchmark (65.5% dev, 59.2% test), despite being much simpler than most recently proposed models for this task. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks. Our implementation is available at https://github.com/ salesforce/TabularSemanticParsing.

@inproceedings{DBLP:conf/emnlp/LinSX20,
  author    = {Xi Victoria Lin and
               Richard Socher and
               Caiming Xiong},
  editor    = {Trevor Cohn and
               Yulan He and
               Yang Liu},
  title     = {Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic
               Parsing},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural
               Language Processing: Findings, {EMNLP} 2020, Online Event, 16-20 November
               2020},
  pages     = {4870--4888},
  publisher = {Association for Computational Linguistics},
  year      = {2020},
  url       = {https://www.aclweb.org/anthology/2020.findings-emnlp.438/},
  timestamp = {Thu, 12 Nov 2020 17:18:16 +0100},
  biburl    = {https://dblp.org/rec/conf/emnlp/LinSX20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

ColloQL: Robust Cross-Domain Text-to-SQL Over Search Queries.
Karthik Radhakrishnan, Arvind Srikantan, Xi Victoria Lin.
EMNLP 2020 Workshop: Interactive and Executable Semantic Parsing.
PDF Abstract Bibtex

Translating natural language utterances to executable queries is a helpful technique in making the vast amount of data stored in relational databases accessible to a wider range of non-tech-savvy end users. Prior work in this area has largely focused on textual input that is linguistically correct and semantically unambiguous. However, real-world user queries are often succinct, colloquial, and noisy, resembling the input of a search engine. In this work, we introduce data augmentation techniques and a sampling-based content-aware BERT model (ColloQL) to achieve robust text-to-SQL modeling over natural language search (NLS) questions. Due to the lack of evaluation data, we curate a new dataset of NLS questions and demonstrate the efficacy of our approach. ColloQL's superior performance extends to well-formed text, achieving 84.9\% (logical) and 90.7\% (execution) accuracy on the WikiSQL dataset, making it, to the best of our knowledge, the highest performing model that does not use execution guided decoding.

@article{DBLP:journals/corr/abs-2010-09927,
  author    = {Karthik Radhakrishnan and
               Arvind Srikantan and
               Xi Victoria Lin},
  title     = {ColloQL: Robust Cross-Domain Text-to-SQL Over Search Queries},
  journal   = {CoRR},
  volume    = {abs/2010.09927},
  year      = {2020},
  url       = {https://arxiv.org/abs/2010.09927},
  eprinttype = {arXiv},
  eprint    = {2010.09927},
  timestamp = {Mon, 26 Oct 2020 15:39:44 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2010-09927.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Photon: A Robust Cross-Domain Text-to-SQL System.
Jichuan Zeng*, Xi Victoria Lin*, Caiming Xiong, Richard Socher, Michael R. Lyu, Irwin King, Steven C.H. Hoi.
ACL 2020 System Demonstration.
PDF Abstract Bibtex Blog Live Demo

Natural language interfaces to databases (NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present Photon, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping cannot be immediately determined. Photon consists of a strong neural semantic parser (63.2\% structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector is a discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input. The live demo of our system is available at http://www.naturalsql.com.

@inproceedings{zeng-etal-2020-photon,
    title = "{P}hoton: A Robust Cross-Domain Text-to-{SQL} System",
    author = "Zeng, Jichuan  and
      Lin, Xi Victoria  and
      Xiong, Caiming  and
      Socher, Richard  and
      Lyu, Michael  and
      King, Irwin and
      Hoi, Steven C.H."
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.24",
    pages = "204--214"
}

Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation.
Tianlu Wang, Xi Victoria Lin, Nazeen Fatema Rajani, Bryan McCann, Vicente Ordonez and Caiming Xiong.
ACL 2020.
PDF Abstract Bibtex Blog

Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.

@InProceedings{Wang2020:double_hard_debias,
  author    = {Tianlu Wang, Xi Victoria Lin, Nazeen Fatema Rajani, Bryan McCann, Vicente Ordonez and Caiming Xiong},
  title     = {Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation},
  booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year      = {2020},
  address   = {Seattle, Washington, USA},
  publisher = {Association for Computational Linguistics}
}

2019

CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases.
Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga, Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vincent Zhang, Caiming Xiong, Richard Socher, Walter Lasecki and Dragomir Radev
EMNLP 2019.
PDF Abstract Bibtex Leaderboard

We present CoSQL, a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions. When user questions are answerable by SQL, the expert describes the SQL and execution results to the user, hence maintaining a natural interaction flow. CoSQL introduces new challenges compared to existing task-oriented dialogue datasets: (1) the dialogue states are grounded in SQL, a domain-independent executable representation, instead of domain-specific slot-value pairs, and (2) because testing is done on unseen databases, success requires generalizing to new domains. CoSQL includes three tasks: SQL-grounded dialogue state tracking, response generation from query results, and user dialogue act prediction. We evaluate a set of strong baselines for each task and show that CoSQL presents significant challenges for future research.

@inproceedings{Yu2019:cosql,
  author = {Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga, Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vincent Zhang, Caiming Xiong, Richard Socher, Walter Lasecki and Dragomir Radev},
  title = {CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural
               Language Processing, {EMNLP} 2019, Hong Kong, November 3-November 7, 2019},
  year = {2019}
}

Editing-based SQL Query Generation for Cross-Domain Context-Dependent Questions.
Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim, Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, Richard Socher and Dragomir Radev.
EMNLP 2019.
PDF Abstract Bibtex

We focus on the cross-domain context-dependent text-to-SQL generation task. Based on the observation that adjacent natural language questions are often linguistically dependent and their corresponding SQL queries tend to overlap, we utilize the interaction history by editing the previous predicted query to improve the generation quality. Our editing mechanism views SQL as sequences and reuses generation results at the token level in a simple manner. It is flexible to change individual tokens and robust to error propagation. Furthermore, to deal with complex table structures in different domains, we employ an utterance-table encoder and a table-aware decoder to incorporate the context of the user utterance and the table schema. We evaluate our approach on the SParC dataset and demonstrate the benefit of editing compared with the state-of-the-art baselines which generate SQL from scratch.

@inproceedings{Zhang2019:Editing,
  author = {Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim, Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, Richard Socher and Dragomir Radev},
  title = {Editing-based SQL Query Generation for Cross-Domain Context-Dependent Questions},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural
               Language Processing, {EMNLP} 2019, Hong Kong, November 3-November 7, 2019},
  year = {2019}
}

SParC: Cross-Domain Semantic Parsing in Context.
Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, Dragomir Radev.
ACL 2019.
PDF Abstract Bibtex Leaderboard

We present SParC, a dataset for cross-domain Semantic Parsing in Context. It consists of 4,298 coherent question sequences (12k+ individual questions annotated with SQL queries), obtained from controlled user interactions with 200 complex databases over 138 domains. We provide an in-depth analysis of SParC and show that it introduces new challenges compared to existing datasets. SParC (1) demonstrates complex contextual dependencies, (2) has greater semantic diversity, and (3) requires generalization to new domains due to its cross-domain nature and the unseen databases at test time. We experiment with two state-of-the-art text-to-SQL models adapted to the context-dependent, cross-domain setup. The best model obtains an exact match accuracy of 20.2% over all questions and less than 10% over all interaction sequences, indicating that the cross-domain setting and the contextual phenomena of the dataset present significant challenges for future research.

@InProceedings{Yu2019:sparc,
  author    = {Tao Yu and Rui Zhang and Michihiro Yasunaga and Yi Chern Tan and Xi Victoria Lin and Suyi Li and Heyang Er, Irene Li and Bo Pang and Tao Chen and Emily Ji and Shreya Dixit and David Proctor and Sungrok Shim and Jonathan Kraft, Vincent Zhang and Caiming Xiong and Richard Socher and Dragomir Radev},
  title     = {SParC: Cross-Domain Semantic Parsing in Context},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  year      = {2019},
  address   = {Florence, Italy},
  publisher = {Association for Computational Linguistics}
}

2018 and Before

Multi-Hop Knowledge Graph Reasoning with Reward Shaping.
Xi Victoria Lin, Richard Socher and Caiming Xiong.
EMNLP 2018.
PDF Abstract Bibtex Talk Code

Multi-hop reasoning is an effective approach for query answering (QA) over incomplete knowledge graphs (KGs). The problem can be formulated in a reinforcement learning (RL) setup, where a policy-based agent sequentially extends its inference path until it reaches a target. However, in an incomplete KG environment, the agent receives low-quality rewards corrupted by false negatives in the training data, which harms generalization at test time. Furthermore, since no golden action sequence is used for training, the agent can be misled by spurious search trajectories that incidentally lead to the correct answer. We propose two modeling advances to address both issues: (1) we reduce the impact of false negative supervision by adopting a pretrained one-hop embedding model to estimate the reward of unobserved facts; (2) we counter the sensitivity to spurious paths of on-policy RL by forcing the agent to explore a diverse set of paths using randomly generated edge masks. Our approach significantly improves over existing path-based KGQA models on several benchmark datasets and is comparable or better than embedding-based models.

@inproceedings{LinRX2018:MultiHopKG,
  author = {Xi Victoria Lin and Richard Socher and Caiming Xiong},
  title = {Multi-Hop Knowledge Graph Reasoning with Reward Shaping},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural
               Language Processing, {EMNLP} 2018, Brussels, Belgium, October
               31-November 4, 2018},
  year = {2018}
}

NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System.
Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer and Michael D. Ernst.
LREC 2018.
PDF Abstract Bibtex Dataset & Code

We present new data and semantic parsing methods for the problem of mapping english sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to easily solve otherwise repetitive tasks (such as file manipulation, search, and application-specific scripting) by simply stating their intents in English. We take a first step in this domain, by providing a large new dataset of challenging but commonly used commands paired with their English descriptions, along with the baseline methods to establish performance levels on this task.

@inproceedings{LinWZE2018:NL2Bash,
      author = {Xi Victoria Lin and Chenglong Wang and Luke Zettlemoyer and Michael D. Ernst},
      title = {NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System},
      booktitle = {Proceedings of the Eleventh International Conference on Language Resources
                   and Evaluation {LREC} 2018, Miyazaki (Japan), 7-12 May, 2018.},
      year = {2018}
    }

Program Synthesis from Natural Language Using Recurrent Neural Networks.
Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, Michael D. Ernst.
University of Washington Department of Computer Science and Engineering Technical Report 2017.
PDF Abstract Bibtex Tellina Tool

Even if a competent programmer knows what she wants to do and can describe it in English, it can still be difficult to write code to achieve the goal. Existing resources, such as question-and-answer websites, tabulate specific operations that someone has wanted to perform in the past, but they are not effective in generalizing to new tasks, to compound tasks that require combining previous questions, or sometimes even to variations of listed tasks.

Our goal is to make programming easier and more productive by letting programmers use their own words and concepts to express the intended operation, rather than forcing them to accommodate the machine by memorizing its grammar. We have built a system that lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language for review and approval by the programmer. Our system, Tellina, does the translation using recurrent neural networks (RNNs), a state-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements.

We evaluated Tellina in the context of shell scripting. We trained Tellina's RNNs on textual descriptions of file system operations and bash one-liners, scraped from the web. Although recovering completely correct commands is challenging, Tellina achieves top-3 accuracy of 80% for producing the correct command structure. In a controlled study, programmers who had access to Tellina outperformed those who did not, even when Tellina's predictions were not completely correct, to a statistically significant degree.

@techreport{LinWPVZE2017:TR,
      author = {Xi Victoria Lin and Chenglong Wang and Deric Pang and Kevin Vu and Luke Zettlemoyer and Michael D. Ernst},
      title = {Program synthesis from natural language using recurrent neural networks},
      institution = {University of Washington Department of Computer Science and Engineering},
      number = {UW-CSE-17-03-01},
      address = {Seattle, WA, USA},
      month = mar,
      year = {2017}
    }

Compositional Learning of Embeddings for Relation Paths in Knowledge Bases and Text.
Kristina Toutanova, Xi Victoria Lin, Scott Wen-tau Yih, Hoifung Poon and Chris Quirk.
ACL 2016.
PDF Abstract Bibtex

Modeling relation paths has offered significant gains in embedding models for knowledge base (KB) completion. However, enumerating paths between two entities is very expensive, and existing approaches typically resort to approximation with a sampled subset. This problem is particularly acute when text is jointly modeled with KB relations and used to provide direct evidence for facts mentioned in it. In this paper, we propose the first exact dynamic programming algorithm which enables efficient incorporation of all relation paths of bounded length, while modeling both relation types and intermediate nodes in the compositional path representations. We conduct a theoretical analysis of the efficiency gain from the approach. Experiments on two datasets show that it addresses representational limitations in prior approaches and improves accuracy in KB completion.

@inproceedings{DBLP:conf/acl/ToutanovaLYPQ16,
      author    = {Kristina Toutanova and
                   Victoria Lin and
                   Wen{-}tau Yih and
                   Hoifung Poon and
                   Chris Quirk},
      title     = {Compositional Learning of Embeddings for Relation Paths in Knowledge
                   Base and Text},
      booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational
                   Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume
                   1: Long Papers},
      year      = {2016},
      crossref  = {DBLP:conf/acl/2016-1},
      url       = {http://aclweb.org/anthology/P/P16/P16-1136.pdf},
      timestamp = {Mon, 15 Aug 2016 20:10:51 +0200},
      biburl    = {http://dblp.org/rec/bib/conf/acl/ToutanovaLYPQ16},
      bibsource = {dblp computer science bibliography, http://dblp.org}
      }

    @proceedings{DBLP:conf/acl/2016-1,
      title     = {Proceedings of the 54th Annual Meeting of the Association for Computational
                   Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume
                   1: Long Papers},
      publisher = {The Association for Computer Linguistics},
      year      = {2016},
      url       = {http://aclanthology.info/volumes/proceedings-of-the-54th-annual-meeting-of-the-association-for-computational-linguistics-volume-1-long-papers},
      isbn      = {978-1-945626-00-5},
      timestamp = {Mon, 15 Aug 2016 15:53:28 +0200},
      biburl    = {http://dblp.org/rec/bib/conf/acl/2016-1},
      bibsource = {dblp computer science bibliography, http://dblp.org}
    }

Multi-label Learning with Posterior Regularization.
Xi Victoria Lin, Sameer Singh, Luheng He, Ben Taskar, and Luke Zettlemoyer.
NeurIPS 2014 Workshop: Modern Machine Learning and NLP.
PDF Abstract Bibtex

In many multi-label learning problems, especially as the number of labels grow, it is challenging to gather completely annotated data. This work presents a new approach for multi-label learning from incomplete annotations. The main assumption is that because of label correlation, the true label matrix as well as the soft predictions of classifiers shall be approximately low rank. We introduce a posterior regularization technique which enforces soft constraints on the classifiers, regularizing them to prefer sparse and low-rank predictions. Avoiding strict low-rank constraints results in classifiers which better fit the real data. The model can be trained efficiently using EM and stochastic gradient descent. Experiments in both the image and text domains demonstrate the contributions of each modeling assumption and show that the proposed approach achieves state-of-the-art performance on a number of challenging datasets..

@InProceedings{lin14_prlr,
      author =   {Xi Victoria Lin and Sameer Singh and Luheng He and Ben Taskar and Luke Zettlemoyer},
      title =    {Multi-label Learning with Posterior Regularization},
      booktitle =  {NeurIPS Workshop on Modern Machine Learning and Natural Language Processing},
      year =     2014,
      month = 12,
      address={Montreal, Quebec, CA},
      url={http://homes.cs.washington.edu/~xilin/pubs/mlnlp2014.pdf}
    }

Service

NAACL 2021 System Demonstration Track Co-chair

Organizing Committee

	2020
INT‑EX	◽
NLC2CMD	◽

Program Committee

	2020	2019	2018	2017	2016	2015
ICML		◽
ACL	◽	◽	◽	◽
EMNLP	◽		◽	◽	◽	◽
NAACL		◽
AACL	◽
COLING			◽
CoNLL			◽
NLI	◽

Miscellaneous

I was a PhD student of the late Ben Taskar.

The Taskar Center for Accessible Technology (TCAT) was lauched by Anat Caspi in November, 2014. I am excited about its mission. Anat's expertise and unique perspective would lead to accessible technologies that could change the life for many.

I'm fascinated by different kinds of puzzles. At some point I tried to make a few: Sea Virus, Chocolate Crush.

Navigation

CV
News
Blog (coming soon!)
Resources

Victoria Lin 林曦

Navigation

About Me

Project Highlights

Publications * Equal Contribution

Preprints

2024

2023

2022

2021

2020

2019

2018 and Before

Service

Miscellaneous

Navigation

Publications
* Equal Contribution