|
Yale YouTube Video Text
|
1 |
2014-10-20 |
434.77MB |
7,521 | 8+ |
0 |
|
Enwiki Word2vec model 1000 Dimensions
|
1 |
2015-04-09 |
8.63GB |
3,473 | 9 |
0 |
|
Structured Web Data Extraction Dataset (SWDE)
|
1 |
2015-11-29 |
207.31MB |
2,758 | 7 |
0 |
|
Online News Popularity Data Set
|
1 |
2016-02-11 |
7.48MB |
3,070 | 7+ |
0 |
|
Sentiment Labelled Sentences Data Set
|
1 |
2016-08-26 |
512.21kB |
508 | 4+ |
0 |
|
MovieLens 20M Dataset
|
1 |
2016-12-16 |
198.70MB |
2,041 | 9+ |
0 |
|
Microsoft Academic Graph - 2016/02/05
|
1 |
2016-12-25 |
28.94GB |
249 | 2+ |
0 |
|
IMDb Large Movie Review Dataset
|
1 |
2018-10-16 |
26.40MB |
889 | 3+ |
0 |
|
Wikitext-103
|
1 |
2018-10-16 |
190.20MB |
480 | 4+ |
0 |
|
Wikitext-2
|
1 |
2018-10-16 |
4.07MB |
248 | 2+ |
0 |
|
WMT 2015 French/English parallel texts
|
1 |
2018-10-16 |
2.60GB |
1,661 | 3+ |
0 |
|
AG News
|
1 |
2018-10-16 |
11.78MB |
220 | 2+ |
0 |
|
Amazon reviews - Full
|
1 |
2018-10-16 |
643.70MB |
1,109 | 4+ |
0 |
|
Amazon reviews - Polarity
|
1 |
2018-10-16 |
688.34MB |
1,094 | 1+ |
0 |
|
DBPedia ontology
|
1 |
2018-10-16 |
68.34MB |
126 | 2+ |
0 |
|
Sogou news
|
1 |
2018-10-16 |
384.27MB |
256 | 2+ |
0 |
|
Yelp reviews - Full
|
1 |
2018-10-16 |
196.15MB |
383 | 2+ |
0 |
|
Yelp reviews - Polarity
|
1 |
2018-10-16 |
166.37MB |
441 | 2+ |
0 |
|
Indiana University - Chest X-Rays (XML Reports)
|
1 |
2018-11-22 |
1.11MB |
39,746 | 25+ |
0 |
|
Europarl v7 - training-parallel-europarl-v7.tgz (CS-EN, DE-EN, ES-EN, FR-EN)
|
1 |
2019-02-04 |
657.63MB |
48 | 2+ |
0 |
|
UN corpus - training-parallel-un.tgz (ES-EN, FR-EN)
|
1 |
2019-02-04 |
2.37GB |
57 | 2+ |
0 |
|
Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN)
|
1 |
2019-02-04 |
918.31MB |
113 | 2+ |
0 |
|
r/WritingPrompts, Text (2018)
|
1 |
2019-06-19 |
87.47MB |
400 | 4 |
0 |
|
Reading Text in the Wild with Convolutional Neural Networks
|
1 |
2021-11-12 |
10.68GB |
39,004 | 30 |
0 |
|
SMS Spam Collection Data Set
|
2 |
2015-11-28 |
695.38kB |
762 | 4+ |
0 |
|
30M Factoid Question-Answer Corpus (30MQA)
|
2 |
2018-11-29 |
529.34MB |
3,942 | 5+ |
0 |
|
Flickr8k Dataset
|
2 |
2019-03-09 |
1.12GB |
11,710 | 24+ |
0 |
|
Lerman Twitter 2010 Dataset
|
3 |
2014-08-15 |
292.17MB |
3,432 | 12+ |
0 |
|
Synthetic Data for Text Localisation in Natural Images
|
15 |
2021-11-15 |
73.50GB |
3,693 | 14 |
2 |
|
PMC Open Access Subset
|
16 |
2020-05-24 |
84.14GB |
237 | 4+ |
0 |
|
OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized
|
395 |
2019-06-01 |
16.02GB |
207 | 3 |
0 |
|
Phishing corpus
|
4555 |
2019-01-02 |
37.48MB |
967 | 2+ |
0 |