Conceptual 12M (CC12M)

Datarows: 10,259,699 image and caption pairs

Conceptual 12M (CC12M) is a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-training. It is larger and covers a much more diverse set of visual concepts than the Conceptual Captions (CC3M), a dataset that is widely used for pre-training and end-to-end training of image captioning models. Learn more about the dataset here.