Datasets for training machine learning algorithms

There is a collection of training datasets available under /mimer/NOBACKUP/Datasets.

Note! A previous path, /cephyr/NOBACKUP/Datasets, will be removed in the future and new datasets are only added to the new location!

IMPORTANT The data can only be used for non-commercial, scientific or educational purposes.

Please pay close attention to the terms of use and the licensing information included for each dataset. Occasionally, citation to the original work is required.

By downloading or accessing the data from Cephyr, you agree to the terms and conditions published by the copyright holder of the data. The responsibility is on the users to check the terms of use of the data and to make sure their use case is aligned with what is permitted by the copyright owner. In case explicit terms of use of the data is missing from the provider's side, the NonCommercial-ShareAlike attribute of the Creative Commons 3.0 license applies.

If a required, publicly available dataset is missing in the following list, please contact us at support. We will do our best to make datasets centrally available.

The following datasets are currently available:

COCO: large-scale object detection, segmentation, and captioning dataset

COCO has several features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labelled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Find the terms of use at https://cocodataset.org/#termsofuse

Objects365

Designed for object detection research with a focus on diverse objects in the wild: http://www.objects365.org/overview.html

- 365 categories
- 2 million images
- 30 million bounding boxes

IMPORTANT The developers require you to cite the following paper if you use this dataset:

@article{Objects365,
 title={Objects365: A Large-scale, High-quality Dataset for Object Detection},
  author={Shuai Shao and Zeming Li and Tianyuan Zhang and Chao Peng and Gang Yu and Jing Li and Xiangyu Zhang and Jian Sun}, 
  journal={ICCV},
  year={2019}
}

KITTI Vision Benchmark Suite

KITTI is a collection of several datasets related to autonomous driving. You can find more details at http://www.cvlibs.net/datasets/kitti/

License

These datasets are published under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License (CC BY-NC-SA). This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license.

Citation

When using this dataset in your research, we will be happy if you cite us! (or bring us some self-made cake or ice-cream)

For the stereo 2012, flow 2012, odometry, object detection or tracking benchmarks, please cite:

@INPROCEEDINGS{Geiger2012CVPR,
  author = {Andreas Geiger and Philip Lenz and Raquel Urtasun},
  title = {Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2012}
}

For the raw dataset, please cite:

@ARTICLE{Geiger2013IJRR,
  author = {Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun},
  title = {Vision meets Robotics: The KITTI Dataset},
  journal = {International Journal of Robotics Research (IJRR)},
  year = {2013}
}

For the road benchmark, please cite:

@INPROCEEDINGS{Fritsch2013ITSC,
  author = {Jannik Fritsch and Tobias Kuehnl and Andreas Geiger},
  title = {A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms},
  booktitle = {International Conference on Intelligent Transportation Systems (ITSC)},
  year = {2013}
}

For the stereo 2015, flow 2015 and scene flow 2015 benchmarks, please cite:

@INPROCEEDINGS{Menze2015CVPR,
  author = {Moritz Menze and Andreas Geiger},
  title = {Object Scene Flow for Autonomous Vehicles},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2015}
}

LibriSpeech ASR corpus

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.: https://www.openslr.org/12

License

The dataset is licensed under the CC BY 4.0 license.

OpenImages-V6

Dataset of 9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localised narratives. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations: https://storage.googleapis.com/openimages/web/factsfigures.html

License

The annotations are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Note: we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Terms-of-Use

IMPORTANT The developers require you to cite the following papers if you use this dataset:

@article{OpenImages,
  author = {Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset and Shahab Kamali and Stefan Popov and Matteo Malloci and Alexander Kolesnikov and Tom Duerig and Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale},
  year = {2020},
  journal = {IJCV}
}
@inproceedings{OpenImagesSegmentation,
  author = {Rodrigo Benenson and Stefan Popov and Vittorio Ferrari},
  title = {Large-scale interactive object segmentation with human annotators},
  booktitle = {CVPR},
  year = {2019}
}
@article{OpenImagesLocNarr,
  author  = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title   = {Connecting Vision and Language with Localized Narratives},
  journal = {arXiv},
  volume  = {1912.03098},
  year    = {2019}
}
@article{OpenImages2,
  title={OpenImages: A public dataset for large-scale multi-label and multi-class image classification.},
  author={Krasin, Ivan and Duerig, Tom and Alldrin, Neil and Ferrari, Vittorio and Abu-El-Haija, Sami and Kuznetsova, Alina and Rom, Hassan and Uijlings, Jasper and Popov, Stefan and Kamali, Shahab and Malloci, Matteo and Pont-Tuset, Jordi and Veit, Andreas and Belongie, Serge and Gomes, Victor and Gupta, Abhinav and Sun, Chen and Chechik, Gal and Cai, David and Feng, Zheyun and Narayanan, Dhyanesh and Murphy, Kevin},
  journal={Dataset available from https://storage.googleapis.com/openimages/web/index.html},
  year={2017}
}

YouTube-VOS Instance Segmentation

https://youtube-vos.org/dataset/

Terms-of-Use: The annotations in this dataset belong to the organisers of the challenge and are licensed under a Creative Commons Attribution 4.0 License.

The data is released for non-commercial research purpose only.

The organisers of the dataset as well as their employers make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the organisers, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted videos that he or she may create from the Database. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. The organisers reserve the right to terminate Researcher's access to the Database at any time. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorised to enter into this agreement on behalf of such employer.

MNIST:

A dataset of 60000 28x28 pixel grayscale images of handwritten images. For more information see the original website https://www.tensorflow.org/datasets/catalog/mnist

We provide access to formats loadable by torchvision.datasets.MNIST and tensorflow.keras.mnist as well as the raw original MNIST data.

Licensing

Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license (CC BY-SA 3.0).

Citation

If you use the dataset for scientific work, please cite the following:

@article{lecun2010mnist,
  title={MNIST handwritten digit database},
  author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
  journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
  volume={2},
  year={2010}
}

MPI Sintel Flow Dataset:

  • A data set for the evaluation of optical flow derived from the open source 3D animated short film, Sintel

oVision-Scene-Flow-dataset

  • Pattern Recognition and Image Processing

NuScenes (v1.0)

  • Terms of use: Non-commercial use only. See: https://www.nuscenes.org/terms-of-use
  • Overview: https://www.nuscenes.org/overview
    • Trainval (700+150 scenes) is packaged into 10 different archives that each contain 85 scenes.
    • Test (150 scenes) is used for challenges and does not come with object annotations.
    • Mini (10 scenes) is a subset of trainval used to explore the data without having to download the entire dataset.
    • The meta data is provided separately and includes the annotations, ego vehicle poses, calibration, maps and log information.
  • Citation: please use the following citation when referencing nuScenes:
@article{nuscenes2019,
 title={nuScenes: A multimodal dataset for autonomous driving},
  author={Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and 
          Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and 
          Giancarlo Baldan and Oscar Beijbom}, 
  journal={arXiv preprint arXiv:1903.11027},
  year={2019}
}

MegaDepth (v1) Dataset

  • Overview: https://www.cs.cornell.edu/projects/megadepth/
    • The MegaDepth dataset includes 196 different locations reconstructed from COLMAP SfM/MVS. (Update images/depth maps with original resolutions generated from COLMAP MVS)
  • Citation: please use the following citation when referencing MegaDepth:
@inProceedings{MegaDepthLi18,
  title={MegaDepth: Learning Single-View Depth Prediction from Internet Photos},
  author={Zhengqi Li and Noah Snavely},
  booktitle={Computer Vision and Pattern Recognition (CVPR)},
  year={2018}
}

Waymo open dataset (v1.2)

The Waymo Open Dataset is comprised of high-resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions

Places365

Data of Places365-Standard

There are 1.8 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set.

Citation

Please cite the following paper if you use Places365 data or CNNs:

Places: A 10 million Image Database for Scene Recognition B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

Terms of use:

by downloading the image data you agree to the following terms:

- You will use the data only for non-commercial research and educational purposes.
- You will NOT distribute the above images.
- Massachusetts Institute of Technology makes no representations or warranties regarding the data, including but not limited to warranties of non-infringement or fitness for a particular purpose.
- You accept full responsibility for your use of the data and shall defend and indemnify Massachusetts Institute of Technology, including its employees, officers and agents, against any and all claims arising from your use of the data, including but not limited to your use of any copies of copyrighted images that you may create from the data.

Lyft Level 5

A comprehensive, large-scale dataset featuring the raw sensor camera and LiDAR inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a bounded geographic area. This dataset also includes high quality, human-labelled 3D bounding boxes of traffic agents, an underlying HD spatial semantic map. Lyft is usefully thought of as two separate subsets Prediction and Perception.

Prediction

This dataset includes the logs of movement of cars, cyclists, pedestrians, and other traffic agents encountered by our autonomous fleet. These logs come from processing raw lidar, camera, and radar data through the Level 5 team’s perception systems. Read more at https://level-5.global/data/prediction/

Licensing

The downloadable “Level 5 Prediction Dataset” and included semantic map data are ©2021 Woven Planet Holdings, Inc. 2020, and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

The HD map included with the dataset was developed using data from the OpenStreetMap database which is ©OpenStreetMap contributors will be released under the Open Database License (ODbL) v1.0 license.

The Python software kit developed by Level 5 to read the dataset is available under the Apache license version 2.0.

The geo-tiff files included in the dataset were developed by ©2020 Nearmap Us, Inc. and are available under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

Citation

If you use the dataset for scientific work, please cite the following:

@misc{Woven Planet Holdings, Inc. 2020,
  title = {One Thousand and One Hours: Self-driving Motion Prediction Dataset},
  author = {Houston, J. and Zuidhof, G. and Bergamini, L. and Ye, Y. and Jain, A. and Omari, S. and Iglovikov, V. and Ondruska, P.},
  year = {2020},
  howpublished = {\url{https://level-5.global/level5/data/}}
}

Perception

A collection of raw sensor camera and lidar data collected from autonomous vehicles on other cars, pedestrians, traffic lights, and more. This dataset features the raw lidar and camera inputs collected by the Level 5 autonomous fleet within a bounded geographic area. Read more at https://level-5.global/data/perception/

Licensing

The downloadable “Level 5 Perception Dataset” and included materials are ©2021 Woven Planet, Inc., and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0)

The HD map included with the dataset was developed using data from the OpenStreetMap database which is ©OpenStreetMap contributors will be released under the Open Database License (ODbL) v1.0 license.

The nuScenes devkit was previously published by nuTonomy under the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0), but is currently published under the Apache license version 2.0. Lyft’s forked nuScenes devkit has been modified for use with the Lyft Level 5 AV dataset. Lyft’s modifications are ©2020 Lyft, Inc., and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

Citation

If you use the dataset for scientific work, please cite the following:

f you use the dataset for scientific work, please cite the following:

@misc{Woven Planet Holdings, Inc. 2019,
  title = {Level 5 Perception Dataset 2020},
  author = {Kesten, R. and Usman, M. and Houston, J. and Pandya, T. and Nadhamuni, K. and Ferreira, A. and Yuan, M. and Low, B. and Jain, A. and Ondruska, P. and Omari, S. and Shah, S. and Kulkarni, A. and Kazakova, A. and Tao, C. and Platinsky, L. and Jiang, W. and Shet, V.},
  year = {2019},
  howpublished = {\url{https://level-5.global/level5/data/}}
}

The CIFAR 10 and 100 datasets

The CIFAR-10 and CIFAR-100 are labelled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

For more information, see https://www.cs.toronto.edu/~kriz/cifar.html

Citation

If you use the dataset for scientific work, please cite the following: Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.

The LSUN dataset

Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

For more information, see https://www.yf.io/p/lsun

Citation

If you use the dataset for scientific work, please cite the following:

@article{yu15lsun,
    Author = {Yu, Fisher and Zhang, Yinda and Song, Shuran and Seff, Ari and Xiao, Jianxiong},
    Title = {LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop},
    Journal = {arXiv preprint arXiv:1506.03365},
    Year = {2015}
}

AlphafoldDatasets

Below are genetic sequence datasets collected for use with AlphaFold. More information is available at /cephyr/NOBACKUP/Datasets/AlphafoldDatasets/README.md.

The datasets have been downloaded using the scripts/download_all_data.sh helper script available from https://github.com/deepmind/alphafold to /cephyr/NOBACKUP/Datasets/AlphafoldDatasets.

UniRef90

https://www.uniprot.org/help/uniref

Citation

https://www.uniprot.org/help/publications

If you find UniProt useful, please consider citing our latest publication:
The UniProt Consortium
UniProt: the universal protein knowledgebase in 2021
Nucleic Acids Res. 49:D1 (2021)
...or choose the publication that best covers the UniProt aspects or components you used in your work:

MGnify

https://www.ebi.ac.uk/metagenomics/

Citation

To cite MGnify, please refer to the following publication:

MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research (2019) doi: 10.1093/nar/gkz1035
Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, Crusoe MR, Kale V, Potter SC, Richardson LJ, Sakharova E, Scheremetjew M, Korobeynikov A, Shlemov A, Kunyavskaya O, Lapidus A and Finn RD.

BFD

https://bfd.mmseqs.com/

Uniclust30

https://uniclust.mmseqs.com/

PDB70

http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/

PDB

https://www.rcsb.org/

Citation

Refer to https://www.rcsb.org/pages/policies#References

BigEarthNet-S2

https://bigearth.net/

License (Community Data License Agreement – Permissive, Version 1.0.)[https://cdla.dev/permissive-1-0/]

Citation

Refer to https://bigearth.net/