Datasets for training machine learning algorithms

There is a collection of training datasets available under /cephyr/NOBACKUP/Datasets. IMPORTANT The data can only be used for non-commercial, scientific or educational purposes.

Please pay close attention to the terms of use and the licensing information included for each dataset. Occasionally, citation to the original work is required.

By downloading or accessing the data from Cephyr, you agree to the terms and conditions published by the copyright holder of the data. The responsibility is on the users to check the terms of use of the data and to make sure their use case is aligned with what is permitted by the copyright owner. In case explicit terms of use of the data is missing from the provider's side, the NonCommercial-ShareAlike attribute of the Creative Commons 3.0 license applies.

If a required, publicly available dataset is missing in the following list, please contact us at support. We will do our best to make datasets centrally available.

The following datasets are currently available:

COCO: large-scale object detection, segmentation, and captioning dataset. COCO has several features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labelled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints


Designed for object detection research with a focus on diverse objects in the wild:

- 365 categories
- 2 million images
- 30 million bounding boxes

IMPORTANT The developers require you to cite the following paper if you use this dataset:

 title={Objects365: A Large-scale, High-quality Dataset for Object Detection},
  author={Shuai Shao and Zeming Li and Tianyuan Zhang and Chao Peng and Gang Yu and Jing Li and Xiangyu Zhang and Jian Sun}, 

LibriSpeech ASR corpus

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.:


The dataset is licensed under the CC BY 4.0 license.


Dataset of 9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localised narratives. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations:


The annotations are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Note: we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.


IMPORTANT The developers require you to cite the following papers if you use this dataset:

  author = {Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset and Shahab Kamali and Stefan Popov and Matteo Malloci and Alexander Kolesnikov and Tom Duerig and Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale},
  year = {2020},
  journal = {IJCV}
  author = {Rodrigo Benenson and Stefan Popov and Vittorio Ferrari},
  title = {Large-scale interactive object segmentation with human annotators},
  booktitle = {CVPR},
  year = {2019}
  author  = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title   = {Connecting Vision and Language with Localized Narratives},
  journal = {arXiv},
  volume  = {1912.03098},
  year    = {2019}
  title={OpenImages: A public dataset for large-scale multi-label and multi-class image classification.},
  author={Krasin, Ivan and Duerig, Tom and Alldrin, Neil and Ferrari, Vittorio and Abu-El-Haija, Sami and Kuznetsova, Alina and Rom, Hassan and Uijlings, Jasper and Popov, Stefan and Kamali, Shahab and Malloci, Matteo and Pont-Tuset, Jordi and Veit, Andreas and Belongie, Serge and Gomes, Victor and Gupta, Abhinav and Sun, Chen and Chechik, Gal and Cai, David and Feng, Zheyun and Narayanan, Dhyanesh and Murphy, Kevin},
  journal={Dataset available from},

YouTube-VOS Instance Segmentation

Terms-of-Use: The annotations in this dataset belong to the organisers of the challenge and are licensed under a Creative Commons Attribution 4.0 License.

The data is released for non-commercial research purpose only.

The organisers of the dataset as well as their employers make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the organisers, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted videos that he or she may create from the Database. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. The organisers reserve the right to terminate Researcher's access to the Database at any time. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorised to enter into this agreement on behalf of such employer.

MPI Sintel Flow Dataset:

  • A data set for the evaluation of optical flow derived from the open source 3D animated short film, Sintel


  • Pattern Recognition and Image Processing

NuScenes (v1.0)

  • Terms of use: Non-commercial use only. See:
  • Overview:
    • Trainval (700+150 scenes) is packaged into 10 different archives that each contain 85 scenes.
    • Test (150 scenes) is used for challenges and does not come with object annotations.
    • Mini (10 scenes) is a subset of trainval used to explore the data without having to download the entire dataset.
    • The meta data is provided separately and includes the annotations, ego vehicle poses, calibration, maps and log information.
  • Citation: please use the following citation when referencing nuScenes:
 title={nuScenes: A multimodal dataset for autonomous driving},
  author={Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and 
          Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and 
          Giancarlo Baldan and Oscar Beijbom}, 
  journal={arXiv preprint arXiv:1903.11027},

Waymo open dataset (v1.2)

The Waymo Open Dataset is comprised of high-resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions


Data of Places365-Standard

There are 1.8 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set.


Please cite the following paper if you use Places365 data or CNNs:

Places: A 10 million Image Database for Scene Recognition B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

Terms of use:

by downloading the image data you agree to the following terms:

- You will use the data only for non-commercial research and educational purposes.
- You will NOT distribute the above images.
- Massachusetts Institute of Technology makes no representations or warranties regarding the data, including but not limited to warranties of non-infringement or fitness for a particular purpose.
- You accept full responsibility for your use of the data and shall defend and indemnify Massachusetts Institute of Technology, including its employees, officers and agents, against any and all claims arising from your use of the data, including but not limited to your use of any copies of copyrighted images that you may create from the data.

Lyft Level 5

A comprehensive, large-scale dataset featuring the raw sensor camera and LiDAR inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a bounded geographic area. This dataset also includes high quality, human-labelled 3D bounding boxes of traffic agents, an underlying HD spatial semantic map.

For a related tutorial, see


If you use the dataset for scientific work, please cite the following:

title = {Lyft Level 5 AV Dataset 2019},
author = {Kesten, R. and Usman, M. and Houston, J. and Pandya, T. and Nadhamuni, K. and Ferreira, A. and Yuan, M. and Low, B. and Jain, A. and Ondruska, P. and Omari, S. and Shah, S. and Kulkarni, A. and Kazakova, A. and Tao, C. and Platinsky, L. and Jiang, W. and Shet, V.},
year = {2019},
howpublished = {url{}}

Licensing Information

The downloadable Lyft Level 5 AV dataset and included materials are © 2019 Lyft, Inc., and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

The HD map included with the dataset was developed using data from the OpenStreetMap database which is © OpenStreetMap contributors and available under the ODbL-1.0 license.

The nuScenes devkit is published by nuTonomy under the CC-BY-NC-SA-4.0. Lyft’s forked nuScenes devkit has been modified for use with the Lyft Level 5 AV dataset. Lyft’s modifications are © 2019 Lyft, Inc., and licensed under the same CC-BY-NC-SA-4.0 license governing the original nuScenes devkit.

The CIFAR 10 and 100 datasets

The CIFAR-10 and CIFAR-100 are labelled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

For more information, see


If you use the dataset for scientific work, please cite the following: Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.

The LSUN dataset

Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

For more information, see


If you use the dataset for scientific work, please cite the following:

    Author = {Yu, Fisher and Zhang, Yinda and Song, Shuran and Seff, Ari and Xiao, Jianxiong},
    Title = {LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop},
    Journal = {arXiv preprint arXiv:1506.03365},
    Year = {2015}


Below are genetic sequence datasets collected for use with AlphaFold. More information is available at /cephyr/NOBACKUP/Datasets/AlphafoldDatasets/

The datasets have been downloaded using the scripts/ helper script available from to /cephyr/NOBACKUP/Datasets/AlphafoldDatasets.



If you find UniProt useful, please consider citing our latest publication:
The UniProt Consortium
UniProt: the universal protein knowledgebase in 2021
Nucleic Acids Res. 49:D1 (2021)
...or choose the publication that best covers the UniProt aspects or components you used in your work:



To cite MGnify, please refer to the following publication:

MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research (2019) doi: 10.1093/nar/gkz1035
Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, Crusoe MR, Kale V, Potter SC, Richardson LJ, Sakharova E, Scheremetjew M, Korobeynikov A, Shlemov A, Kunyavskaya O, Lapidus A and Finn RD.






Refer to