ELMo tab

Task	Previous SOTA	Our Baseline	ELMo + Baseline	Increase (Absolute/Relative)
SQuAD	Liu et al. (2017) $84.4$	$81.1$	$85.8$	$4.70/24.9\%$
SNLI	Chen et al. (2017) $88.6$	$88.0$	$88.7\pm0.17$	$0.70/05.8\%$
SRL	He et al. (2017) $81.7$	$81.4$	$84.6$	$3.20/17.2\%$
Coref	Lee et al. (2017) $67.2$	$67.2$	$70.4$	$3.20/09.8\%$
NER	Peters et al. (2017) $91.93\pm0.19$	$90.15$	$92.22\pm0.10$	$2.06/21.0\%$
SST-5	McCann et al. (2017) $53.7$	$51.4$	$54.7\pm0.5$	$3.30/06.8\%$

Table 1: Test set comparison of ELMo enhanced neural models with state-of-the-art single model baselines across six benchmark NLP tasks. The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 for SQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the mean and standard deviation across five runs with different random seeds. The increase column lists both the absolute and relative improvements over our baseline.

Question answering. The Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) contains 100K+ crowd sourced questionanswer pairs where the answer is a span in a given Wikipedia paragraph. Our baseline model (Clark and Gardner, 2017) is an improved version of the Bidirectional Attention Flow model in Seo et al. (BiDAF;2017)
Textual entailment. Textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”. The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) provides approximately 550K hypothesis/premise pairs. Our baseline, the ESIM sequence model from Chen et al. (2017), uses a biLSTM to encode the premise and hypothesis, followed by a matrix attention layer, a local inference layer, another biLSTM inference composition layer, and finally a pooling operation before the output layer.
Semantic role labeling. A semantic role labeling (SRL) system models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom".
Coreference resolution. Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities. Our baseline model is the end-to-end span-based neural model of Lee et al.(2017).
Named entity extraction. The CoNLL 2003 NER task (Sang and Meulder,2003) consists of newswire from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC). Following recent state-of-the-art systems (Lample et al., 2016; Peters et al., 2017), the baseline model uses pre-trained word embeddings, a character-based CNN representation, two biLSTM layers and a conditional random field (CRF) loss (Lafferty et al., 2001), similar to Collobert et al. (2011).
Sentiment analysis. The fine-grained sentiment classification task in the Stanford Sentiment Treebank (SST-5; Socher et al., 2013) involves selecting one of five labels (from very negative to very positive) to describe a sentence from a movie review.

BERT Fig1