Task | Previous SOTA | Our Baseline | ELMo + Baseline | Increase (Absolute/Relative) |
---|---|---|---|---|
SQuAD | Liu et al. (2017) | |||
SNLI | Chen et al. (2017) | |||
SRL | He et al. (2017) | |||
Coref | Lee et al. (2017) | |||
NER | Peters et al. (2017) | |||
SST-5 | McCann et al. (2017) |
Table 1: Test set comparison of ELMo enhanced neural models with
state-of-the-art single model baselines across six benchmark NLP tasks. The
performance metric varies across tasks – accuracy for SNLI and SST-5;
F1 for SQuAD, SRL and NER; average F1 for Coref. Due to the small
test sizes for NER and SST-5, we report the mean and standard
deviation across five runs with different random seeds. The increase
column lists both the absolute and relative improvements over our baseline.
- Question answering. The Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) contains 100K+ crowd sourced questionanswer pairs where the answer is a span in a given Wikipedia paragraph. Our baseline model (Clark and Gardner, 2017) is an improved version of the Bidirectional Attention Flow model in Seo et al. (BiDAF;2017)
- Textual entailment. Textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”. The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) provides approximately 550K hypothesis/premise pairs. Our baseline, the ESIM sequence model from Chen et al. (2017), uses a biLSTM to encode the premise and hypothesis, followed by a matrix attention layer, a local inference layer, another biLSTM inference composition layer, and finally a pooling operation before the output layer.
- Semantic role labeling. A semantic role labeling (SRL) system models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom".
- Coreference resolution. Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities. Our baseline model is the end-to-end span-based neural model of Lee et al.(2017).
- Named entity extraction. The CoNLL 2003 NER task (Sang and Meulder,2003) consists of newswire from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC). Following recent state-of-the-art systems (Lample et al., 2016; Peters et al., 2017), the baseline model uses pre-trained word embeddings, a character-based CNN representation, two biLSTM layers and a conditional random field (CRF) loss (Lafferty et al., 2001), similar to Collobert et al. (2011).
- Sentiment analysis. The fine-grained sentiment classification task in the Stanford Sentiment Treebank (SST-5; Socher et al., 2013) involves selecting one of five labels (from very negative to very positive) to describe a sentence from a movie review.