Datasets for training machine learning algorithms🔗

There is a collection of training datasets available under /mimer/NOBACKUP/Datasets.

Note! A previous path, /cephyr/NOBACKUP/Datasets, will be removed in the future and new datasets are only added to the new location!

IMPORTANT The data can only be used for non-commercial, scientific or educational purposes.

Please pay close attention to the terms of use and the licensing information included for each dataset. Occasionally, citation to the original work is required.

By downloading or accessing the data from Mimer, you agree to the terms and conditions published by the copyright holder of the data. The responsibility is on the users to check the terms of use of the data and to make sure their use case is aligned with what is permitted by the copyright owner. In case explicit terms of use of the data is missing from the provider's side, the NonCommercial-ShareAlike attribute of the Creative Commons 3.0 license applies.

If a required, publicly available dataset is missing in the following list, please contact us at support. We will do our best to make datasets centrally available.

The following datasets are currently available:

Dataloading pipeline🔗

The dataloading pipeline is a crucial component for computational efficiency in supervised training. Some recomendations for Alvis and Mimer are (in order of usefulness):

  1. Reading directly from Mimer is usually fast enough, that way you'll save time that would have had to been used to transfer to $TMPDIR or /dev/shm.
  2. Many small files are slow prefer to bunch the data into only a few larger files (zip, hdf5, tfrecords).
  3. Use compression wisely:
    1. Jpeg, png, gz, or other files that don't benefit from (further) compression shouldn't be compressed (zip -0).
    2. Very compressible formats (csv, mmcif, txt, ...) can benefit a bit from compression.
  4. When shuffling data zip is preffered over tar and if you can shuffle before processing data points.
  5. Using a profiler is very useful to get an idea what works or not and what part of the pipeline it is that takes time (see TensorBoard).
  6. Don't spend more time optimising the dataloading pipeline than you'll save from the optimisation.
  7. Batch size, number of preprocessing workers and prefetching strategy can have significant impact as a rule of thumb:
    1. If loading data is fast, use the 0th (current thread) as worker.
    2. Otherwise you can spend some effort in setting up prefetching and loading multiple workers.
  8. A100 nodes have the fastest connection to Mimer, but it is not in all cases that you're at the limit where you'll notice this.

You can find examples on implementations of dataloading pipelines on Alvis in our tutorial repository.


BDD100K is a diverse driving dataset for heterogeneous multitask learning.


To cite the dataset in your paper:

    author = {Yu, Fisher and Chen, Haofeng and Wang, Xin and Xian, Wenqi and Chen,
              Yingying and Liu, Fangchen and Madhavan, Vashisht and Darrell, Trevor},
    title = {BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2020}


The code and other resources provided by the BDD100K code repo are in BSD 3-Clause License.

The data and labels downloaded from are under the License below.

Copyright ©2018. The Regents of the University of California (Regents). All Rights Reserved.


Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement; and permission to use, copy, modify and distribute this software for commercial purposes (such rights not subject to transfer) to BDD and BAIR Commons members and their affiliates, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201,, for commercial licensing opportunities.




License (Community Data License Agreement – Permissive, Version 1.0.)[]


Refer to

Zenseact Open Dataset🔗

The Zenseact Open Dataset (ZOD) is a large multi-modal autonomous driving dataset developed by a team of researchers at Zenseact. The dataset is split into three categories: Frames, Sequences, and Drives. For more information about the dataset, please refer to our coming soon, or visit our website.


To preserve privacy, the dataset is anonymized. The anonymization is performed by brighterAI, and we provide two separate modes of anonymization: deep fakes (DNAT) and blur. In our paper, we show that the performance of an object detector is not affected by the anonymization method. For more details regarding this experiment, please refer to our coming soon.


If you publish work that uses Zenseact Open Dataset, please cite: coming soon

  author = {TODO},
  title = {Zenseact Open Dataset},
  year = {2023},
  publisher = {TODO},
  journal = {TODO},


For questions about the dataset, please Contact Us.

Zenseact are interested in knowing who at Alvis is using the dataset and what use case they've found. You are encouraged to reach out if you to them if you are using the dataset.


We welcome contributions to the development kit. If you would like to contribute, please open a pull request.


Dataset: This dataset is the property of Zenseact AB (© 2023 Zenseact AB) and is licensed under CC BY-SA 4.0. Any public use, distribution, or display of this dataset must contain this notice in full:

For this dataset, Zenseact AB has taken all reasonable measures to remove all personally identifiable information, including faces and license plates. To the extent that you like to request the removal of specific images from the dataset, please contact

COCO: large-scale object detection, segmentation, and captioning dataset🔗

COCO has several features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labelled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Find the terms of use at


Designed for object detection research with a focus on diverse objects in the wild:

- 365 categories
- 2 million images
- 30 million bounding boxes

IMPORTANT The developers require you to cite the following paper if you use this dataset:

 title={Objects365: A Large-scale, High-quality Dataset for Object Detection},
  author={Shuai Shao and Zeming Li and Tianyuan Zhang and Chao Peng and Gang Yu and Jing Li and Xiangyu Zhang and Jian Sun}, 

KITTI Vision Benchmark Suite🔗

KITTI is a collection of several datasets related to autonomous driving. You can find more details at


These datasets are published under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License (CC BY-NC-SA). This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license.


When using this dataset in your research, we will be happy if you cite us! (or bring us some self-made cake or ice-cream)

For the stereo 2012, flow 2012, odometry, object detection or tracking benchmarks, please cite:

  author = {Andreas Geiger and Philip Lenz and Raquel Urtasun},
  title = {Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2012}

For the raw dataset, please cite:

  author = {Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun},
  title = {Vision meets Robotics: The KITTI Dataset},
  journal = {International Journal of Robotics Research (IJRR)},
  year = {2013}

For the road benchmark, please cite:

  author = {Jannik Fritsch and Tobias Kuehnl and Andreas Geiger},
  title = {A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms},
  booktitle = {International Conference on Intelligent Transportation Systems (ITSC)},
  year = {2013}

For the stereo 2015, flow 2015 and scene flow 2015 benchmarks, please cite:

  author = {Moritz Menze and Andreas Geiger},
  title = {Object Scene Flow for Autonomous Vehicles},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2015}

LibriSpeech ASR corpus🔗

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.:


The dataset is licensed under the CC BY 4.0 license.

E-GMD: The Expanded Groove MIDI Dataset🔗

Description provided from

The Expanded Groove MIDI Dataset (E-GMD) is a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. It is also the first human-performed drum transcription dataset with annotations of velocity. It is based on our previously released Groove MIDI Dataset.


This dataset is an expansion of the Groove MIDI Dataset (GMD). GMD is a dataset of human drum performances recorded in MIDI format on a Roland TD-11 electronic drum kit. To make the dataset applicable to ADT, we expanded it by re-recording the GMD sequences on 43 drumkits using a Roland TD-17. The kits range from electronic (e.g., 808, 909) to acoustic sounds. Recording was done at 44.1kHz and 24 bits and aligned within 2ms of the original MIDI files.

We maintained the same train, test and validation splits across sequences that GMD had. Because each kit was recorded for every sequence, we see all 43 kits in the train, test and validation splits.

Split Unique Sequences Total Sequences Duration (hours)
Train 819 35217 341.4
Test 123 5289 50.9
Validation 117 5031 52.2
Total 1059 45537 444.5

Given the semi-manual nature of the pipeline, there were some errors in the recording process that resulted in unusable tracks. If your application requires only symbolic drum data, we recommend using the original data from the Groove MIDI Dataset.


The dataset is made available by Google LLC under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

How to Cite🔗

How to Cite

If you use the E-GMD dataset in your work, please cite the paper where it was introduced:

Lee Callender, Curtis Hawthorne, and Jesse Engel. "Improving Perceptual Quality
  of Drum Transcription with the Expanded Groove MIDI Dataset." 2020.

You can also use the following BibTeX entry:

    title={Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset},
    author={Lee Callender and Curtis Hawthorne and Jesse Engel},


Description from

The Synthesized Lakh (Slakh) Dataset is a dataset of multi-track audio and aligned MIDI for music source separation and multi-instrument automatic transcription. Individual MIDI tracks are synthesized from the Lakh MIDI Dataset v0.1 using professional-grade sample-based virtual instruments, and the resulting audio is mixed together to make musical mixtures. This release of Slakh, called Slakh2100, contains 2100 automatically mixed tracks and accompanying, aligned MIDI files, synthesized from 187 instrument patches categorized into 34 classes, totaling 145 hours of mixture data.

At a glance🔗

  • The dataset comes as a series of directories named like TrackXXXXX, where XXXXX is a number between 00001 and 02100. This number is the ID of the track. Each Track directory contains exactly 1 mixture, a variable number of audio files for each source that made the mixture, and the MIDI files that were used to synthesize each source. The directory structure is shown here.
  • All audio in Slakh2100 is distributed in the .flac format. Scripts to batch convert are here.
  • All audio is mono and was rendered at 44.1kHz, 16-bit (CD quality) before being converted to .flac.
  • Slakh2100 is a 105 Gb download. Unzipped and converted to .wav, Slakh2100 is almost 500 Gb. Please plan accordingly.
  • Each mixture has a variable number of sources, with a minimum of 4 sources per mix.
  • Every mix as at least 1 instance of each of the following instrument types: Piano, Guitar, Drums, Bass.
  • metadata.yaml has detailed information about each source. Details about the metadata are here.


Creative Commons Attribution 4.0 International.

How to cite🔗

If you use Slakh2100 or generate data using the same method we ask that you cite it using the following bibtex entry:

    title={Cutting Music Source Separation Some {Slakh}: A Dataset to Study the Impact of Training Data Quality and Quantity},
    author={Manilow, Ethan and Wichern, Gordon and Seetharaman, Prem and Le Roux, Jonathan},
    booktitle={Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},


Dataset of 9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localised narratives. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations:


The annotations are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Note: we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.


IMPORTANT The developers require you to cite the following papers if you use this dataset:

  author = {Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset and Shahab Kamali and Stefan Popov and Matteo Malloci and Alexander Kolesnikov and Tom Duerig and Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale},
  year = {2020},
  journal = {IJCV}
  author = {Rodrigo Benenson and Stefan Popov and Vittorio Ferrari},
  title = {Large-scale interactive object segmentation with human annotators},
  booktitle = {CVPR},
  year = {2019}
  author  = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title   = {Connecting Vision and Language with Localized Narratives},
  journal = {arXiv},
  volume  = {1912.03098},
  year    = {2019}
  title={OpenImages: A public dataset for large-scale multi-label and multi-class image classification.},
  author={Krasin, Ivan and Duerig, Tom and Alldrin, Neil and Ferrari, Vittorio and Abu-El-Haija, Sami and Kuznetsova, Alina and Rom, Hassan and Uijlings, Jasper and Popov, Stefan and Kamali, Shahab and Malloci, Matteo and Pont-Tuset, Jordi and Veit, Andreas and Belongie, Serge and Gomes, Victor and Gupta, Abhinav and Sun, Chen and Chechik, Gal and Cai, David and Feng, Zheyun and Narayanan, Dhyanesh and Murphy, Kevin},
  journal={Dataset available from},

YouTube-VOS Instance Segmentation🔗

Terms-of-Use: The annotations in this dataset belong to the organisers of the challenge and are licensed under a Creative Commons Attribution 4.0 License.

The data is released for non-commercial research purpose only.

The organisers of the dataset as well as their employers make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the organisers, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted videos that he or she may create from the Database. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. The organisers reserve the right to terminate Researcher's access to the Database at any time. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorised to enter into this agreement on behalf of such employer.


A dataset of 60000 28x28 pixel grayscale images of handwritten images. For more information see the original website

We provide access to formats loadable by torchvision.datasets.MNIST and tensorflow.keras.mnist as well as the raw original MNIST data.


Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license (CC BY-SA 3.0).


If you use the dataset for scientific work, please cite the following:

  title={MNIST handwritten digit database},
  author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
  journal={ATT Labs [Online]. Available:},

MPI Sintel Flow Dataset:🔗

  • A data set for the evaluation of optical flow derived from the open source 3D animated short film, Sintel


  • Pattern Recognition and Image Processing

NuScenes (v1.0)🔗

  • Terms of use: Non-commercial use only. See:
  • Overview:
    • Trainval (700+150 scenes) is packaged into 10 different archives that each contain 85 scenes.
    • Test (150 scenes) is used for challenges and does not come with object annotations.
    • Mini (10 scenes) is a subset of trainval used to explore the data without having to download the entire dataset.
    • The meta data is provided separately and includes the annotations, ego vehicle poses, calibration, maps and log information.
  • Citation: please use the following citation when referencing nuScenes:
 title={nuScenes: A multimodal dataset for autonomous driving},
  author={Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and 
          Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and 
          Giancarlo Baldan and Oscar Beijbom}, 
  journal={arXiv preprint arXiv:1903.11027},

MegaDepth (v1) Dataset🔗

  • Overview:
    • The MegaDepth dataset includes 196 different locations reconstructed from COLMAP SfM/MVS. (Update images/depth maps with original resolutions generated from COLMAP MVS)
  • Citation: please use the following citation when referencing MegaDepth:
  title={MegaDepth: Learning Single-View Depth Prediction from Internet Photos},
  author={Zhengqi Li and Noah Snavely},
  booktitle={Computer Vision and Pattern Recognition (CVPR)},

Waymo open dataset (v1.2)🔗

The Waymo Open Dataset is comprised of high-resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions


Data of Places365-Standard

There are 1.8 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set.


Please cite the following paper if you use Places365 data or CNNs:

Places: A 10 million Image Database for Scene Recognition B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

Terms of use:🔗

by downloading the image data you agree to the following terms:

- You will use the data only for non-commercial research and educational purposes.
- You will NOT distribute the above images.
- Massachusetts Institute of Technology makes no representations or warranties regarding the data, including but not limited to warranties of non-infringement or fitness for a particular purpose.
- You accept full responsibility for your use of the data and shall defend and indemnify Massachusetts Institute of Technology, including its employees, officers and agents, against any and all claims arising from your use of the data, including but not limited to your use of any copies of copyrighted images that you may create from the data.

Lyft Level 5🔗

A comprehensive, large-scale dataset featuring the raw sensor camera and LiDAR inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a bounded geographic area. This dataset also includes high quality, human-labelled 3D bounding boxes of traffic agents, an underlying HD spatial semantic map. Lyft is usefully thought of as two separate subsets Prediction and Perception.


This dataset includes the logs of movement of cars, cyclists, pedestrians, and other traffic agents encountered by our autonomous fleet. These logs come from processing raw lidar, camera, and radar data through the Level 5 team’s perception systems. Read more at


The downloadable “Level 5 Prediction Dataset” and included semantic map data are ©2021 Woven Planet Holdings, Inc. 2020, and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

The HD map included with the dataset was developed using data from the OpenStreetMap database which is ©OpenStreetMap contributors will be released under the Open Database License (ODbL) v1.0 license.

The Python software kit developed by Level 5 to read the dataset is available under the Apache license version 2.0.

The geo-tiff files included in the dataset were developed by ©2020 Nearmap Us, Inc. and are available under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).


If you use the dataset for scientific work, please cite the following:

@misc{Woven Planet Holdings, Inc. 2020,
  title = {One Thousand and One Hours: Self-driving Motion Prediction Dataset},
  author = {Houston, J. and Zuidhof, G. and Bergamini, L. and Ye, Y. and Jain, A. and Omari, S. and Iglovikov, V. and Ondruska, P.},
  year = {2020},
  howpublished = {\url{}}


A collection of raw sensor camera and lidar data collected from autonomous vehicles on other cars, pedestrians, traffic lights, and more. This dataset features the raw lidar and camera inputs collected by the Level 5 autonomous fleet within a bounded geographic area. Read more at


The downloadable “Level 5 Perception Dataset” and included materials are ©2021 Woven Planet, Inc., and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0)

The HD map included with the dataset was developed using data from the OpenStreetMap database which is ©OpenStreetMap contributors will be released under the Open Database License (ODbL) v1.0 license.

The nuScenes devkit was previously published by nuTonomy under the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0), but is currently published under the Apache license version 2.0. Lyft’s forked nuScenes devkit has been modified for use with the Lyft Level 5 AV dataset. Lyft’s modifications are ©2020 Lyft, Inc., and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).


If you use the dataset for scientific work, please cite the following:

f you use the dataset for scientific work, please cite the following:

@misc{Woven Planet Holdings, Inc. 2019,
  title = {Level 5 Perception Dataset 2020},
  author = {Kesten, R. and Usman, M. and Houston, J. and Pandya, T. and Nadhamuni, K. and Ferreira, A. and Yuan, M. and Low, B. and Jain, A. and Ondruska, P. and Omari, S. and Shah, S. and Kulkarni, A. and Kazakova, A. and Tao, C. and Platinsky, L. and Jiang, W. and Shet, V.},
  year = {2019},
  howpublished = {\url{}}

The CIFAR 10 and 100 datasets🔗

The CIFAR-10 and CIFAR-100 are labelled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

For more information, see


If you use the dataset for scientific work, please cite the following: Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.

The LSUN dataset🔗

Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

For more information, see


If you use the dataset for scientific work, please cite the following:

    Author = {Yu, Fisher and Zhang, Yinda and Song, Shuran and Seff, Ari and Xiao, Jianxiong},
    Title = {LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop},
    Journal = {arXiv preprint arXiv:1506.03365},
    Year = {2015}


Below are genetic sequence datasets collected for use with AlphaFold. More information is available at /mimer/NOBACKUP/Datasets/AlphafoldDatasets/

The datasets have been downloaded using the scripts/ helper script available from to /mimer/NOBACKUP/Datasets/AlphafoldDatasets.



If you find UniProt useful, please consider citing our latest publication:
The UniProt Consortium
UniProt: the universal protein knowledgebase in 2021
Nucleic Acids Res. 49:D1 (2021)
...or choose the publication that best covers the UniProt aspects or components you used in your work:



To cite MGnify, please refer to the following publication:

MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research (2019) doi: 10.1093/nar/gkz1035
Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, Crusoe MR, Kale V, Potter SC, Richardson LJ, Sakharova E, Scheremetjew M, Korobeynikov A, Shlemov A, Kunyavskaya O, Lapidus A and Finn RD.






Refer to