Text [RSS] [CSV] curated by joecohen

Name	Files	Added	Size	DLs
Yale YouTube Video Text	1	2014-10-20	434.77MB	8,268	8+	0
Enwiki Word2vec model 1000 Dimensions	1	2015-04-09	8.63GB	3,482	10	0
Structured Web Data Extraction Dataset (SWDE)	1	2015-11-29	207.31MB	2,810	5	0
Online News Popularity Data Set	1	2016-02-11	7.48MB	3,090	4+	0
Sentiment Labelled Sentences Data Set	1	2016-08-26	512.21kB	530	6+	0
MovieLens 20M Dataset	1	2016-12-16	198.70MB	2,088	9+	0
Microsoft Academic Graph - 2016/02/05	1	2016-12-25	28.94GB	265	3+	1
IMDb Large Movie Review Dataset	1	2018-10-16	26.40MB	990	6+	0
Wikitext-103	1	2018-10-16	190.20MB	931	9+	0
Wikitext-2	1	2018-10-16	4.07MB	249	2+	0
WMT 2015 French/English parallel texts	1	2018-10-16	2.60GB	2,264	2+	0
AG News	1	2018-10-16	11.78MB	222	2+	0
Amazon reviews - Full	1	2018-10-16	643.70MB	1,154	7+	0
Amazon reviews - Polarity	1	2018-10-16	688.34MB	1,110	2+	0
DBPedia ontology	1	2018-10-16	68.34MB	145	3+	0
Sogou news	1	2018-10-16	384.27MB	264	2+	0
Yelp reviews - Full	1	2018-10-16	196.15MB	402	3+	0
Yelp reviews - Polarity	1	2018-10-16	166.37MB	447	2+	0
Indiana University - Chest X-Rays (XML Reports)	1	2018-11-22	1.11MB	48,139	20+	0
Europarl v7 - training-parallel-europarl-v7.tgz (CS-EN, DE-EN, ES-EN, FR-EN)	1	2019-02-04	657.63MB	51	2+	0
UN corpus - training-parallel-un.tgz (ES-EN, FR-EN)	1	2019-02-04	2.37GB	62	2+	0
Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN)	1	2019-02-04	918.31MB	121	3+	0
r/WritingPrompts, Text (2018)	1	2019-06-19	87.47MB	412	5	0
Reading Text in the Wild with Convolutional Neural Networks	1	2021-11-12	10.68GB	43,188	26	0
SMS Spam Collection Data Set	2	2015-11-28	695.38kB	823	4+	0
30M Factoid Question-Answer Corpus (30MQA)	2	2018-11-29	529.34MB	4,982	7+	0
Flickr8k Dataset	2	2019-03-09	1.12GB	15,094	12+	1
Lerman Twitter 2010 Dataset	3	2014-08-15	292.17MB	3,494	13+	1
Synthetic Data for Text Localisation in Natural Images	15	2021-11-15	73.50GB	3,853	11	1
PMC Open Access Subset	16	2020-05-24	84.14GB	285	7+	0
OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized	395	2019-06-01	16.02GB	212	4	0
Phishing corpus	4555	2019-01-02	37.48MB	1,010	5+	0

Send Feedback