Datasets for training machine learning algorithms🔗

There is a collection of training datasets available under /mimer/NOBACKUP/Datasets.

IMPORTANT The data can only be used for non-commercial, scientific or educational purposes.

Please pay close attention to the terms of use and the licensing information included for each dataset. Occasionally, citation to the original work is required.

By downloading or accessing the data from Mimer, you agree to the terms and conditions published by the copyright holder of the data. The responsibility is on the users to check the terms of use of the data and to make sure their use case is aligned with what is permitted by the copyright owner. In case explicit terms of use of the data is missing from the provider's side, the NonCommercial-ShareAlike attribute of the Creative Commons 3.0 license applies.

If a required, publicly available dataset is missing in the following list, please contact us at support. We will do our best to make datasets centrally available.

The following datasets are currently available:

Dataloading pipeline🔗

The dataloading pipeline is a crucial component for computational efficiency in supervised training. Some recomendations for Alvis and Mimer are (in order of usefulness):

  1. Reading directly from Mimer is usually fast enough, that way you'll save time that would have had to been used to transfer to $TMPDIR or /dev/shm.
  2. Many small files are slow prefer to bunch the data into only a few larger files (zip, hdf5, tfrecords).
  3. Use compression wisely:
    1. Jpeg, png, gz, or other files that don't benefit from (further) compression shouldn't be compressed (zip -0).
    2. Very compressible formats (csv, mmcif, txt, ...) can benefit a bit from compression.
  4. When shuffling data zip is preffered over tar and if you can shuffle before processing data points.
  5. Using a profiler is very useful to get an idea what works or not and what part of the pipeline it is that takes time (see TensorBoard).
  6. Don't spend more time optimising the dataloading pipeline than you'll save from the optimisation.
  7. Batch size, number of preprocessing workers and prefetching strategy can have significant impact as a rule of thumb:
    1. If loading data is fast, use the 0th (current thread) as worker.
    2. Otherwise you can spend some effort in setting up prefetching and loading multiple workers.
  8. A100 nodes have the fastest connection to Mimer, but it is not in all cases that you're at the limit where you'll notice this.

You can find examples on implementations of dataloading pipelines on Alvis in our tutorial repository.

Argoverse 2🔗

Argoverse 2 is a collection of open-source autonomous driving data and high-definition (HD) maps from six U.S. cities: Austin, Detroit, Miami, Pittsburgh, Palo Alto, and Washington, D.C. This release builds upon the initial launch of Argoverse ("Argoverse 1"), which was among the first data releases of its kind to include HD maps for machine learning and computer vision research.

Argoverse 2 includes four open-source datasets:

  • Argoverse 2 Sensor Dataset: contains 1,000 3D annotated scenarios with lidar, stereo imagery, and ring camera imagery. This dataset improves upon the Argoverse 1 3D Tracking dataset.
  • Argoverse 2 Motion Forecasting Dataset: contains 250,000 scenarios with trajectory data for many object types. This dataset improves upon the Argoverse 1 Motion Forecasting Dataset.
  • Argoverse 2 Lidar Dataset: contains 20,000 unannotated lidar sequences.
  • Argoverse 2 Map Change Dataset: contains 1,000 scenarios, 200 of which depict real-world HD map changes

Files🔗

The dataset is provided as zip files. To make use of them, use libraries such as zipfile or torchdata to stream data from the packed archives for better performance. See notes on dataloading pipeline.

Terms of use🔗

The dataset is under a "CC BY-NC-SA 4.0" license. See https://www.argoverse.org/about.html#terms-of-use for details.

Citation🔗

Please use the following citations

@INPROCEEDINGS { Argoverse2,
  author = {Benjamin Wilson and William Qi and Tanmay Agarwal and John Lambert and Jagjeet Singh and Siddhesh Khandelwal and Bowen Pan and Ratnesh Kumar and Andrew Hartnett and Jhony Kaesemodel Pontes and Deva Ramanan and Peter Carr and James Hays},
  title = {Argoverse 2: Next Generation Datasets for Self-driving Perception and Forecasting},
  booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021)},
  year = {2021}
}

@INPROCEEDINGS { TrustButVerify,
  author = {John Lambert and James Hays},
  title = {Trust, but Verify: Cross-Modality Fusion for HD Map Change Detection},
  booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021)},
  year = {2021}
}

BDD100K🔗

BDD100K is a diverse driving dataset for heterogeneous multitask learning.

Citation🔗

To cite the dataset in your paper:

@InProceedings{bdd100k,
    author = {Yu, Fisher and Chen, Haofeng and Wang, Xin and Xian, Wenqi and Chen,
              Yingying and Liu, Fangchen and Madhavan, Vashisht and Darrell, Trevor},
    title = {BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2020}
}

License🔗

The code and other resources provided by the BDD100K code repo are in BSD 3-Clause License.

The data and labels downloaded from https://bdd-data.berkeley.edu/ are under the License below.

Copyright ©2018. The Regents of the University of California (Regents). All Rights Reserved.

THIS SOFTWARE AND/OR DATA WAS DEPOSITED IN THE BAIR OPEN RESEARCH COMMONS REPOSITORY ON 1/1/2021

Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement; and permission to use, copy, modify and distribute this software for commercial purposes (such rights not subject to transfer) to BDD and BAIR Commons members and their affiliates, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, otl@berkeley.edu, http://ipira.berkeley.edu/industry-info for commercial licensing opportunities.

IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED “AS IS”. REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

BigEarthNet-S2🔗

https://bigearth.net/

License (Community Data License Agreement – Permissive, Version 1.0.)[https://cdla.dev/permissive-1-0/]

Citation🔗

Refer to https://bigearth.net/

Zenseact Open Dataset🔗

The Zenseact Open Dataset (ZOD) is a large multi-modal autonomous driving dataset developed by a team of researchers at Zenseact. The dataset is split into three categories: Frames, Sequences, and Drives. For more information about the dataset, please refer to our coming soon, or visit our website.

Anonymization🔗

To preserve privacy, the dataset is anonymized. The anonymization is performed by brighterAI, and we provide two separate modes of anonymization: deep fakes (DNAT) and blur. In our paper, we show that the performance of an object detector is not affected by the anonymization method. For more details regarding this experiment, please refer to our coming soon.

Citation🔗

If you publish work that uses Zenseact Open Dataset, please cite: coming soon

@misc{zod2021,
  author = {TODO},
  title = {Zenseact Open Dataset},
  year = {2023},
  publisher = {TODO},
  journal = {TODO},

Contact🔗

For questions about the dataset, please Contact Us.

Zenseact are interested in knowing who at Alvis is using the dataset and what use case they've found. You are encouraged to reach out if you to them if you are using the dataset.

Contributing🔗

We welcome contributions to the development kit. If you would like to contribute, please open a pull request.

License🔗

Dataset: This dataset is the property of Zenseact AB (© 2023 Zenseact AB) and is licensed under CC BY-SA 4.0. Any public use, distribution, or display of this dataset must contain this notice in full:

For this dataset, Zenseact AB has taken all reasonable measures to remove all personally identifiable information, including faces and license plates. To the extent that you like to request the removal of specific images from the dataset, please contact privacy@zenseact.com.

COCO: large-scale object detection, segmentation, and captioning dataset🔗

COCO has several features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labelled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Find the terms of use at https://cocodataset.org/#termsofuse

Objects365🔗

Designed for object detection research with a focus on diverse objects in the wild: http://www.objects365.org/overview.html

- 365 categories
- 2 million images
- 30 million bounding boxes

IMPORTANT The developers require you to cite the following paper if you use this dataset:

@article{Objects365,
 title={Objects365: A Large-scale, High-quality Dataset for Object Detection},
  author={Shuai Shao and Zeming Li and Tianyuan Zhang and Chao Peng and Gang Yu and Jing Li and Xiangyu Zhang and Jian Sun}, 
  journal={ICCV},
  year={2019}
}

ImageNet🔗

"ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. The data is available for free to researchers for non-commercial use." - https://image-net.org/index.php

Terms of Access🔗

https://image-net.org/download.php

[RESEARCHER_FULLNAME] (the "Researcher") has requested permission to use the ImageNet database (the "Database") at Princeton University and Stanford University. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:

  1. Researcher shall use the Database only for non-commercial research and educational purposes.
  2. Princeton University and Stanford University make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
  3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the ImageNet team, Princeton University, and Stanford University, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database.
  4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
  5. Princeton University and Stanford University reserve the right to terminate Researcher's access to the Database at any time.
  6. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.
  7. The law of the State of New Jersey shall apply to all disputes under this agreement.

To use the data available in hf-cache you instead have to follow the following modification:

[RESEARCHER_FULLNAME] (the "Researcher") has requested permission to use the ImageNet database (the "Database") at Princeton University and Stanford University. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:

  1. Researcher shall use the Database only for non-commercial research and educational purposes.
  2. Princeton University, Stanford University and Hugging Face make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
  3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the ImageNet team, Princeton University, Stanford University and Hugging Face, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database.
  4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
  5. Princeton University, Stanford University and Hugging Face reserve the right to terminate Researcher's access to the Database at any time.
  6. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.
  7. The law of the State of New Jersey shall apply to all disputes under this agreement.

Citation information🔗

@article{imagenet15russakovsky,
    Author = {Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and Alexander C. Berg and Li Fei-Fei},
    Title = { {ImageNet Large Scale Visual Recognition Challenge} },
    Year = {2015},
    journal   = {International Journal of Computer Vision (IJCV)},
    doi = {10.1007/s11263-015-0816-y},
    volume={115},
    number={3},
    pages={211-252}
}

Getting read access to this dataset on Mimer🔗

Access is granted by joining the "imagenet-license-agreement" group on SUPR. Through this group we can verify that you've accepted the Terms of Access and you will be able to read the data under this repository.

N.B.: Group membership is updated on log-in, for changes to apply on the cluster:

  1. Wait fifteen minutes or so after joining the SUPR group
  2. Log-out and log-in again to the cluster

Using this dataset🔗

This dataset is provided both as the privacy-aware face obfuscated images provided by ImageNet and HuggingFace

Raw files🔗

The raw files for face-blurred ILSVRC are available as zip files. You can read from these archives directly. For example in Python you can use the built-in zipfile module like this:

import io
import os
import re
from pathlib import Path
from zipfile import ZipFile

from PIL import Image
from torch.utils.data import Dataset


class ImageNetDataset(Dataset):
    def __init__(self, dataroot: str, train: bool = True):
        self.zf = ZipFile(
            os.path.join(
                dataroot,
                f"{'train' if train else 'val'}_blurred.zip",
            )
        )
        self.imglist: list[str] = [
            path for path in self.zf.namelist()
            if path.endswith(".jpg")
        ]

        # Images are structured in directories based on class
        re_classname = re.compile('/(n[0-9]+)/$')
        self.classes: dict[str, int] = {}  # dict of class name and label value
        for name in self.zf.namelist():
            # Match directories and add if not already added
            match = re_classname.search(name)
            if match:
                classname = match.group(1)
                if classname not in self.classes:
                    self.classes[classname] = len(self.classes)

    def get_label(self, path: str) -> int:
        if not path.endswith(".jpg"):
            raise ValueError(f"Expected path to image, got {path}")
        classname: str = os.path.basename(path).split("_")[0]
        return self.classes[classname]

    def __len__(self):
        return len(self.imglist)

    def __getitem__(self, idx: int) -> tuple[Image.Image, int]:
        imgpath = self.imglist[idx]
        img = Image.open(io.BytesIO(self.zf.read(imgpath)))
        label = self.get_label(imgpath)
        return img, label


dataset = ImageNetDataset('/mimer/NOBACKUP/Datasets/ImageNet/Face-blurred_ILSVRC2012-2017')
print(dataset[1032])

Using HuggingFace datasets🔗

This method requires some extra work the first time you use it, but if you're already using HuggingFace dataset it might be worth it.

Requirements:

  1. Get a HuggingFace account
  2. Approve ImageNet agreement at https://huggingface.co/datasets/imagenet-1k
  3. Create a access token with read access https://huggingface.co/docs/hub/security-tokens
  4. Link the central dataset to your HuggingFace cache (because HuggingFace requires read access to the parent directory of the dataset location)
export HF_HOME="/mimer/NOBACKUP/groups/my-proj/hf-home"  # consider setting this in ~/.bashrc
mkdir -p "$HF_HOME/datasets/imagenet-1k/default/1.0.0/"
ln -s /mimer/NOBACKUP/Datasets/ImageNet/hf-cache/imagenet-1k/1.0.0/* "$HF_HOME/datasets/imagenet-1k/default/1.0.0/"
  1. Load the HF-Datasets module module load HF-Datasets/ and then you can load ImageNet dataset in your code like this:
import os
from getpass import getpass

from datasets import load_dataset


# No user has write access to the dataset so we softly disable file-locking
os.environ["HF_USE_SOFTFILELOCK"] = "true"

# This will ask for your HuggingFace access token from step 3
load_dataset("imagenet-1k", token=getpass("Enter HuggingFace access token: "))

KITTI Vision Benchmark Suite🔗

KITTI is a collection of several datasets related to autonomous driving. You can find more details at http://www.cvlibs.net/datasets/kitti/

License🔗

These datasets are published under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License (CC BY-NC-SA). This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license.

Citation🔗

When using this dataset in your research, we will be happy if you cite us! (or bring us some self-made cake or ice-cream)

For the stereo 2012, flow 2012, odometry, object detection or tracking benchmarks, please cite:

@INPROCEEDINGS{Geiger2012CVPR,
  author = {Andreas Geiger and Philip Lenz and Raquel Urtasun},
  title = {Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2012}
}

For the raw dataset, please cite:

@ARTICLE{Geiger2013IJRR,
  author = {Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun},
  title = {Vision meets Robotics: The KITTI Dataset},
  journal = {International Journal of Robotics Research (IJRR)},
  year = {2013}
}

For the road benchmark, please cite:

@INPROCEEDINGS{Fritsch2013ITSC,
  author = {Jannik Fritsch and Tobias Kuehnl and Andreas Geiger},
  title = {A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms},
  booktitle = {International Conference on Intelligent Transportation Systems (ITSC)},
  year = {2013}
}

For the stereo 2015, flow 2015 and scene flow 2015 benchmarks, please cite:

@INPROCEEDINGS{Menze2015CVPR,
  author = {Moritz Menze and Andreas Geiger},
  title = {Object Scene Flow for Autonomous Vehicles},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2015}
}

LibriSpeech ASR corpus🔗

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.: https://www.openslr.org/12

License🔗

The dataset is licensed under the CC BY 4.0 license.

E-GMD: The Expanded Groove MIDI Dataset🔗

Description provided from https://magenta.tensorflow.org/datasets/e-gmd:

The Expanded Groove MIDI Dataset (E-GMD) is a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. It is also the first human-performed drum transcription dataset with annotations of velocity. It is based on our previously released Groove MIDI Dataset.

Dataset🔗

This dataset is an expansion of the Groove MIDI Dataset (GMD). GMD is a dataset of human drum performances recorded in MIDI format on a Roland TD-11 electronic drum kit. To make the dataset applicable to ADT, we expanded it by re-recording the GMD sequences on 43 drumkits using a Roland TD-17. The kits range from electronic (e.g., 808, 909) to acoustic sounds. Recording was done at 44.1kHz and 24 bits and aligned within 2ms of the original MIDI files.

We maintained the same train, test and validation splits across sequences that GMD had. Because each kit was recorded for every sequence, we see all 43 kits in the train, test and validation splits.

Split Unique Sequences Total Sequences Duration (hours)
Train 819 35217 341.4
Test 123 5289 50.9
Validation 117 5031 52.2
Total 1059 45537 444.5

Given the semi-manual nature of the pipeline, there were some errors in the recording process that resulted in unusable tracks. If your application requires only symbolic drum data, we recommend using the original data from the Groove MIDI Dataset.

License🔗

The dataset is made available by Google LLC under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

How to Cite🔗

How to Cite

If you use the E-GMD dataset in your work, please cite the paper where it was introduced:

Lee Callender, Curtis Hawthorne, and Jesse Engel. "Improving Perceptual Quality
  of Drum Transcription with the Expanded Groove MIDI Dataset." 2020.
  arXiv:2004.00188.

You can also use the following BibTeX entry:

@misc{callender2020improving,
    title={Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset},
    author={Lee Callender and Curtis Hawthorne and Jesse Engel},
    year={2020},
    eprint={2004.00188},
    archivePrefix={arXiv},
    primaryClass={cs.SD}
}

Slakh2100-Redux🔗

Description from https://zenodo.org/record/4599666#.Y-pJuezMLUI:

The Synthesized Lakh (Slakh) Dataset is a dataset of multi-track audio and aligned MIDI for music source separation and multi-instrument automatic transcription. Individual MIDI tracks are synthesized from the Lakh MIDI Dataset v0.1 using professional-grade sample-based virtual instruments, and the resulting audio is mixed together to make musical mixtures. This release of Slakh, called Slakh2100, contains 2100 automatically mixed tracks and accompanying, aligned MIDI files, synthesized from 187 instrument patches categorized into 34 classes, totaling 145 hours of mixture data.

At a glance🔗

  • The dataset comes as a series of directories named like TrackXXXXX, where XXXXX is a number between 00001 and 02100. This number is the ID of the track. Each Track directory contains exactly 1 mixture, a variable number of audio files for each source that made the mixture, and the MIDI files that were used to synthesize each source. The directory structure is shown here.
  • All audio in Slakh2100 is distributed in the .flac format. Scripts to batch convert are here.
  • All audio is mono and was rendered at 44.1kHz, 16-bit (CD quality) before being converted to .flac.
  • Slakh2100 is a 105 Gb download. Unzipped and converted to .wav, Slakh2100 is almost 500 Gb. Please plan accordingly.
  • Each mixture has a variable number of sources, with a minimum of 4 sources per mix.
  • Every mix as at least 1 instance of each of the following instrument types: Piano, Guitar, Drums, Bass.
  • metadata.yaml has detailed information about each source. Details about the metadata are here.

License🔗

Creative Commons Attribution 4.0 International.

How to cite🔗

If you use Slakh2100 or generate data using the same method we ask that you cite it using the following bibtex entry:

@inproceedings{manilow2019cutting,
    title={Cutting Music Source Separation Some {Slakh}: A Dataset to Study the Impact of Training Data Quality and Quantity},
    author={Manilow, Ethan and Wichern, Gordon and Seetharaman, Prem and Le Roux, Jonathan},
    booktitle={Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
    year={2019},
    organization={IEEE}
}

OpenImages-V6🔗

Dataset of 9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localised narratives. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations: https://storage.googleapis.com/openimages/web/factsfigures.html

License🔗

The annotations are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Note: we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Terms-of-Use🔗

IMPORTANT The developers require you to cite the following papers if you use this dataset:

@article{OpenImages,
  author = {Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset and Shahab Kamali and Stefan Popov and Matteo Malloci and Alexander Kolesnikov and Tom Duerig and Vittorio Ferrari},
  title = {The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale},
  year = {2020},
  journal = {IJCV}
}
@inproceedings{OpenImagesSegmentation,
  author = {Rodrigo Benenson and Stefan Popov and Vittorio Ferrari},
  title = {Large-scale interactive object segmentation with human annotators},
  booktitle = {CVPR},
  year = {2019}
}
@article{OpenImagesLocNarr,
  author  = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title   = {Connecting Vision and Language with Localized Narratives},
  journal = {arXiv},
  volume  = {1912.03098},
  year    = {2019}
}
@article{OpenImages2,
  title={OpenImages: A public dataset for large-scale multi-label and multi-class image classification.},
  author={Krasin, Ivan and Duerig, Tom and Alldrin, Neil and Ferrari, Vittorio and Abu-El-Haija, Sami and Kuznetsova, Alina and Rom, Hassan and Uijlings, Jasper and Popov, Stefan and Kamali, Shahab and Malloci, Matteo and Pont-Tuset, Jordi and Veit, Andreas and Belongie, Serge and Gomes, Victor and Gupta, Abhinav and Sun, Chen and Chechik, Gal and Cai, David and Feng, Zheyun and Narayanan, Dhyanesh and Murphy, Kevin},
  journal={Dataset available from https://storage.googleapis.com/openimages/web/index.html},
  year={2017}
}

YouTube-VOS Instance Segmentation🔗

https://youtube-vos.org/dataset/

Terms-of-Use: The annotations in this dataset belong to the organisers of the challenge and are licensed under a Creative Commons Attribution 4.0 License.

The data is released for non-commercial research purpose only.

The organisers of the dataset as well as their employers make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the organisers, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted videos that he or she may create from the Database. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. The organisers reserve the right to terminate Researcher's access to the Database at any time. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorised to enter into this agreement on behalf of such employer.

MNIST:🔗

A dataset of 60000 28x28 pixel grayscale images of handwritten images. For more information see the original website https://www.tensorflow.org/datasets/catalog/mnist

We provide access to formats loadable by torchvision.datasets.MNIST and tensorflow.keras.mnist as well as the raw original MNIST data.

Licensing🔗

Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license (CC BY-SA 3.0).

Citation🔗

If you use the dataset for scientific work, please cite the following:

@article{lecun2010mnist,
  title={MNIST handwritten digit database},
  author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
  journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
  volume={2},
  year={2010}
}

MPI Sintel Flow Dataset:🔗

  • A data set for the evaluation of optical flow derived from the open source 3D animated short film, Sintel

oVision-Scene-Flow-dataset🔗

  • Pattern Recognition and Image Processing

NuScenes (v1.0)🔗

  • Terms of use: Non-commercial use only. See: https://www.nuscenes.org/terms-of-use
  • Overview: https://www.nuscenes.org/overview
    • Trainval (700+150 scenes) is packaged into 10 different archives that each contain 85 scenes.
    • Test (150 scenes) is used for challenges and does not come with object annotations.
    • Mini (10 scenes) is a subset of trainval used to explore the data without having to download the entire dataset.
    • The meta data is provided separately and includes the annotations, ego vehicle poses, calibration, maps and log information.
  • Citation: please use the following citation when referencing nuScenes:
@article{nuscenes2019,
 title={nuScenes: A multimodal dataset for autonomous driving},
  author={Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and 
          Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and 
          Giancarlo Baldan and Oscar Beijbom}, 
  journal={arXiv preprint arXiv:1903.11027},
  year={2019}
}

MegaDepth (v1) Dataset🔗

  • Overview: https://www.cs.cornell.edu/projects/megadepth/
    • The MegaDepth dataset includes 196 different locations reconstructed from COLMAP SfM/MVS. (Update images/depth maps with original resolutions generated from COLMAP MVS)
  • Citation: please use the following citation when referencing MegaDepth:
@inProceedings{MegaDepthLi18,
  title={MegaDepth: Learning Single-View Depth Prediction from Internet Photos},
  author={Zhengqi Li and Noah Snavely},
  booktitle={Computer Vision and Pattern Recognition (CVPR)},
  year={2018}
}

Waymo open dataset (v1.2)🔗

The Waymo Open Dataset is comprised of high-resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions

Places365🔗

Data of Places365-Standard

There are 1.8 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set.

Citation🔗

Please cite the following paper if you use Places365 data or CNNs:

Places: A 10 million Image Database for Scene Recognition B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

Terms of use:🔗

by downloading the image data you agree to the following terms:

- You will use the data only for non-commercial research and educational purposes.
- You will NOT distribute the above images.
- Massachusetts Institute of Technology makes no representations or warranties regarding the data, including but not limited to warranties of non-infringement or fitness for a particular purpose.
- You accept full responsibility for your use of the data and shall defend and indemnify Massachusetts Institute of Technology, including its employees, officers and agents, against any and all claims arising from your use of the data, including but not limited to your use of any copies of copyrighted images that you may create from the data.

Lyft Level 5🔗

A comprehensive, large-scale dataset featuring the raw sensor camera and LiDAR inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a bounded geographic area. This dataset also includes high quality, human-labelled 3D bounding boxes of traffic agents, an underlying HD spatial semantic map. Lyft is usefully thought of as two separate subsets Prediction and Perception.

Prediction🔗

This dataset includes the logs of movement of cars, cyclists, pedestrians, and other traffic agents encountered by our autonomous fleet. These logs come from processing raw lidar, camera, and radar data through the Level 5 team’s perception systems. Read more at https://level-5.global/data/prediction/

Licensing🔗

The downloadable “Level 5 Prediction Dataset” and included semantic map data are ©2021 Woven Planet Holdings, Inc. 2020, and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

The HD map included with the dataset was developed using data from the OpenStreetMap database which is ©OpenStreetMap contributors will be released under the Open Database License (ODbL) v1.0 license.

The Python software kit developed by Level 5 to read the dataset is available under the Apache license version 2.0.

The geo-tiff files included in the dataset were developed by ©2020 Nearmap Us, Inc. and are available under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

Citation🔗

If you use the dataset for scientific work, please cite the following:

@misc{Woven Planet Holdings, Inc. 2020,
  title = {One Thousand and One Hours: Self-driving Motion Prediction Dataset},
  author = {Houston, J. and Zuidhof, G. and Bergamini, L. and Ye, Y. and Jain, A. and Omari, S. and Iglovikov, V. and Ondruska, P.},
  year = {2020},
  howpublished = {\url{https://level-5.global/level5/data/}}
}

Perception🔗

A collection of raw sensor camera and lidar data collected from autonomous vehicles on other cars, pedestrians, traffic lights, and more. This dataset features the raw lidar and camera inputs collected by the Level 5 autonomous fleet within a bounded geographic area. Read more at https://level-5.global/data/perception/

Licensing🔗

The downloadable “Level 5 Perception Dataset” and included materials are ©2021 Woven Planet, Inc., and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0)

The HD map included with the dataset was developed using data from the OpenStreetMap database which is ©OpenStreetMap contributors will be released under the Open Database License (ODbL) v1.0 license.

The nuScenes devkit was previously published by nuTonomy under the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0), but is currently published under the Apache license version 2.0. Lyft’s forked nuScenes devkit has been modified for use with the Lyft Level 5 AV dataset. Lyft’s modifications are ©2020 Lyft, Inc., and licensed under version 4.0 of the Creative Commons Attribution-NonCommercial-ShareAlike license (CC-BY-NC-SA-4.0).

Citation🔗

If you use the dataset for scientific work, please cite the following:

f you use the dataset for scientific work, please cite the following:

@misc{Woven Planet Holdings, Inc. 2019,
  title = {Level 5 Perception Dataset 2020},
  author = {Kesten, R. and Usman, M. and Houston, J. and Pandya, T. and Nadhamuni, K. and Ferreira, A. and Yuan, M. and Low, B. and Jain, A. and Ondruska, P. and Omari, S. and Shah, S. and Kulkarni, A. and Kazakova, A. and Tao, C. and Platinsky, L. and Jiang, W. and Shet, V.},
  year = {2019},
  howpublished = {\url{https://level-5.global/level5/data/}}
}

The CIFAR 10 and 100 datasets🔗

The CIFAR-10 and CIFAR-100 are labelled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

For more information, see https://www.cs.toronto.edu/~kriz/cifar.html

Citation🔗

If you use the dataset for scientific work, please cite the following: Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.

The LSUN dataset🔗

Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

For more information, see https://www.yf.io/p/lsun

Citation🔗

If you use the dataset for scientific work, please cite the following:

@article{yu15lsun,
    Author = {Yu, Fisher and Zhang, Yinda and Song, Shuran and Seff, Ari and Xiao, Jianxiong},
    Title = {LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop},
    Journal = {arXiv preprint arXiv:1506.03365},
    Year = {2015}
}

AlphafoldDatasets🔗

Below are genetic sequence datasets collected for use with AlphaFold. More information is available at /mimer/NOBACKUP/Datasets/AlphafoldDatasets/README.md.

The datasets have been downloaded using the scripts/download_all_data.sh helper script available from https://github.com/deepmind/alphafold to /mimer/NOBACKUP/Datasets/AlphafoldDatasets.

UniRef90🔗

https://www.uniprot.org/help/uniref

Citation🔗

https://www.uniprot.org/help/publications

If you find UniProt useful, please consider citing our latest publication:
The UniProt Consortium
UniProt: the universal protein knowledgebase in 2021
Nucleic Acids Res. 49:D1 (2021)
...or choose the publication that best covers the UniProt aspects or components you used in your work:

MGnify🔗

https://www.ebi.ac.uk/metagenomics/

Citation🔗

To cite MGnify, please refer to the following publication:

MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research (2019) doi: 10.1093/nar/gkz1035
Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, Crusoe MR, Kale V, Potter SC, Richardson LJ, Sakharova E, Scheremetjew M, Korobeynikov A, Shlemov A, Kunyavskaya O, Lapidus A and Finn RD.

BFD🔗

https://bfd.mmseqs.com/

Uniclust30🔗

https://uniclust.mmseqs.com/

PDB70🔗

http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/

PDB🔗

https://www.rcsb.org/

Citation🔗

Refer to https://www.rcsb.org/pages/policies#References