Name: LAION-400-MILLION OPEN DATASET
Creator: None
License: https://creativecommons.org/licenses/by/4.0/

Info hash	34b94abbcefef5a240358b9acd7920c8b675aacc
Last mirror activity	10:53:11 ago
Size	1.21TB (1,211,103,363,514 bytes)
Added	2021-09-14 16:57:58
Views	1141
Hits	1007
ID	4677
Type	multi
Downloaded	190 time(s)
Uploaded by	joecohen
Folder	laion400m-met-release
Num files	1265 files [See full list]
Mirrors	4 complete, 2 downloading = 6 mirror(s) total [Log in to see full list]

laion400m-met-release (1265 files)

laion400m-embeddings/images/img_emb_268.npy	1.02GB
laion400m-embeddings/images/img_emb_267.npy	1.02GB
laion400m-embeddings/images/img_emb_266.npy	1.02GB
laion400m-embeddings/images/img_emb_265.npy	1.02GB
laion400m-embeddings/images/img_emb_264.npy	1.02GB
laion400m-embeddings/images/img_emb_263.npy	1.02GB
laion400m-embeddings/images/img_emb_262.npy	1.02GB
laion400m-embeddings/images/img_emb_261.npy	1.02GB
laion400m-embeddings/images/img_emb_260.npy	1.02GB
laion400m-embeddings/images/img_emb_26.npy	1.02GB
laion400m-embeddings/images/img_emb_259.npy	1.02GB
laion400m-embeddings/images/img_emb_258.npy	1.02GB
laion400m-embeddings/images/img_emb_257.npy	1.02GB
laion400m-embeddings/images/img_emb_256.npy	1.02GB
laion400m-embeddings/images/img_emb_255.npy	1.02GB
laion400m-embeddings/images/img_emb_254.npy	1.02GB
laion400m-embeddings/images/img_emb_253.npy	1.02GB
laion400m-embeddings/images/img_emb_252.npy	1.02GB
laion400m-embeddings/images/img_emb_251.npy	1.02GB
laion400m-embeddings/images/img_emb_250.npy	1.02GB
laion400m-embeddings/images/img_emb_25.npy	1.02GB
laion400m-embeddings/images/img_emb_249.npy	1.02GB
laion400m-embeddings/images/img_emb_248.npy	1.02GB
laion400m-embeddings/images/img_emb_247.npy	1.02GB
laion400m-embeddings/images/img_emb_246.npy	1.02GB
laion400m-embeddings/images/img_emb_245.npy	1.02GB
laion400m-embeddings/images/img_emb_244.npy	1.02GB
laion400m-embeddings/images/img_emb_243.npy	1.02GB
laion400m-embeddings/images/img_emb_242.npy	1.02GB
laion400m-embeddings/images/img_emb_241.npy	1.02GB
laion400m-embeddings/images/img_emb_240.npy	1.02GB
laion400m-embeddings/images/img_emb_24.npy	1.02GB
laion400m-embeddings/images/img_emb_239.npy	1.02GB
laion400m-embeddings/images/img_emb_238.npy	1.02GB
laion400m-embeddings/images/img_emb_237.npy	1.02GB
laion400m-embeddings/images/img_emb_236.npy	1.02GB
laion400m-embeddings/images/img_emb_235.npy	1.02GB
laion400m-embeddings/images/img_emb_234.npy	1.02GB
laion400m-embeddings/images/img_emb_233.npy	1.02GB
laion400m-embeddings/images/img_emb_232.npy	1.02GB
laion400m-embeddings/images/img_emb_231.npy	1.02GB
laion400m-embeddings/images/img_emb_230.npy	1.02GB
laion400m-embeddings/images/img_emb_23.npy	1.02GB
laion400m-embeddings/images/img_emb_229.npy	1.02GB
laion400m-embeddings/images/img_emb_228.npy	1.02GB
laion400m-embeddings/images/img_emb_227.npy	1.02GB
laion400m-embeddings/images/img_emb_226.npy	1.02GB
laion400m-embeddings/images/img_emb_225.npy	1.02GB
laion400m-embeddings/images/img_emb_224.npy	1.02GB
Too many files! Click here to view them all.

Type: Dataset
Tags:

Bibtex:

@article{,
title= {LAION-400-MILLION OPEN DATASET},
journal= {},
author= {},
year= {},
url= {https://laion.ai/laion-400-open-dataset/},
abstract= {LAION-400M 
 
The world’s largest openly available image-text-pair dataset with 400 million samples.

# Concept and Content
The LAION-400M dataset is completely openly, freely accessible.

All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. 
The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021.

# Download Information
You can find

The CLIP image embeddings (NumPy files)

The parquet files

KNN index of image embeddings

# LAION-400M Dataset Statistics
The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like this:
```
Number of unique samples 413M
Number with height or width >= 1024 26M                                   
Number with height and width >= 1024 9.6M                                  
Number with height or width >= 512 112M                                  
Number with height and width >= 512 67M                                 
Number with height or width >= 256 268M                                  
Number with height and width >= 256 211M
```
By using the KNN index specialized datasets can also be extracted by domains of interest. They are (or will be) sufficient in size to train domain specialized models.

# Disclaimer & Content Warning
Our filtering protocol only removed NSFW images that were detected as illegal but the dataset still has NSFW content accordingly marked in the metadata. Please use the demo links with caution. You can extract a “safe” subset by filtering out samples marked with NSFW or via stricter CLIP filtering.

 
There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs causing duplicates. The same image with different captions is not, however, considered duplicated.
 
Using KNN clustering should make it easy to further deduplicate by image content.

# License

We are distributing the metadata dataset (the parquet files) under the most open creative common CC-BY 4.0 license. It poses no particular restriction. The images are under their own copyright.},
keywords= {},
terms= {},
license= {https://creativecommons.org/licenses/by/4.0/},
superseded= {}
}