Skip to content

Bulk data transfer to and from Alvis

Any larger transfers to and from Alvis should make sure to use alvis2 log-in node which is the dedicated data transfer node on Alvis.

ADDS - the Alvis Data Downloader Service

The ADDS system is a service offered on Alvis which allows you to submit background tasks to download datasets consisting of a large number of individual files over HTTP/HTTPS to local storage.

If the data is stored on a different kind of storage resource like on Azure, S3, SSH/SFTP to another cluster, etc. then a tool like rclone is more suitable.

Data transfer jobs run in the background on a storage login node. The user-facing interface is a command line tool addsctl.

Datasets

A dataset consists of a set of URLs to individual data files which shall be downloaded into a directory on local storage. The tools on offer accept text files with one URL per line and prepares tasks for the downloader backend to work through.

The addsctl tool

For command line access on the I/O login node alvis2 there is an addsctl tool that can report the status of pending and completed tasks, as well as convert file listings for a dataset into a download request. The tool has several subcommands.

addsctl request DATASET BASEDIR URLFILE

Schedule a download for a dataset into a given directory on cluster storage which downloads all the files from the URL list file provided on the command line.

The URL list file should contain one plaintext URL per line. Note that as the files are collected in a single directory if multiple links have the same filename the resulting files will collide.

Links are deduplicated and fed to the download backend. Progress information can be queried with the status subcommand or by looking in the book-keeping directories outlined below. Files are stored directly in the dataset directory without any nested directories.

addsctl status [DATASET]

Reports approximate status for the download of a specific dataset or for all known datasets.

State directory structure for advanced users

The downloader state is kept in a hidden directory .alvis-data-downloader under the user home directory.

Request information is stored in JSON files that move across the directories as downloading progresses and can be directly inspected if the frontend tools are not sufficient.

rclone

rclone is a tool for copying data to, from, and between various cloud storage services. It comes with a large number of configuration templates for common services, like those offered by Microsoft, Google and Amazon. Creating a configuration with authentication credentials for most services is easiest to do in an interactive desktop session, which you can most easily access via the Alvis portal. Before you start, locate the rclone documentation pages for usage of the specific service you are using. Then, open an OnDemand interactive desktop session, open a terminal, and type

$ rclone config

Enter n for "new", and choose a name for your configuration (such as the name of the service). Refer to the rclone documentation. If the service you are using has already provided you with an authentication token, you can enter it directly. Otherwise, in most cases rclone will prompt you to open a link in a browser and log in to the service, in order to obtain the necessary credentials.

Once you have set up the configuration, you can check the files and directories on the cloud service using

$ rclone lsf NAME_OF_YOUR_CONFIGURATION:/
my_directory/
my_file.txt

To see the detailed contents of a specific directory, you can use

$ rclone lsl NAME_OF_YOUR_CONFIGURATION:/my_directory
1604608 2024-04-04 02:23:12.0000000 my_other_file.txt

To copy files from the service to the local file area, use

$ rclone copy NAME_OF_YOUR_CONFIGURATION:/my_directory /PATH/TO/YOUR/LOCAL/STORAGE

and vice versa (assuming you have configured rclone for both retrieving and uploading data):

$ rclone copy /PATH/TO/YOUR/LOCAL/STORAGE NAME_OF_YOUR_CONFIGURATION:/my_directory

For further details, consult man rclone, the online documentation, or type rclone --help.

rrsync - restricting ssh key usage

rrsync is a utility which allows you to restrict SSH private-public key pairs to only function for a certain subset of rsync operations. A typical case would be that you have a folder of important data, which you are worried about accidentally overwriting. Moreover, you have other important folders, and you don't want to accidentally overwrite those either. So you want to make sure that your usage is restricted to reading from a specific folder, and writing to a specific folder.

For the purpose of this example, we will assume you have a folder on the mimer system, e.g. /mimer/NOBACKUP/groups/groupname, and in there two sub-folders, write_to and read_from. Our goal is to set up a configuration whereby there is no risk of accidentally overwriting files outside of write_to, or reading any other files than those in read_from.

In order to do this, start by creating 2 ssh keys on your local computer:

$ ssh-keygen -f .ssh/write_key
$ ssh-keygen -f .ssh/read_key

Print the public keys and/or copy them by your preferred means to alvis1 or alvis2:

$ cat .ssh/write_key.pub
$ cat .ssh/read_key.pub

If it doesn't already exist, create a file called authorized_keys on alvis2. Paste the two public keys in there, so that your authorized_keys looks something like

$ ssh-rsa AAAAA...username@local_computer
$ ssh-rsa AAAAA...username@local_computer

We can test that these keys work by creating a regular ssh configuration on your computer. Do the following:

$ touch .ssh/config
$ chmod 600 .ssh/config

Then edit .ssh/config with a text editor like vim, and add the following, replacing the value under User with your CID:

Host alvis2-write
    HostName 129.16.125.131
    User YOUR_CID
    IdentitiesOnly yes
    IdentityFile ~/.ssh/write_key
    PasswordAuthentication no

Host alvis2-read
    HostName 129.16.125.131
    User YOUR_CID
    IdentitiesOnly yes
    IdentityFile ~/.ssh/read_key
    PasswordAuthentication no

Test that your new configuration works, by doing ssh alvis2-read and ssh alvis2-write from your local computer.

Then go into .ssh/authorized_keys on alvis2, and modify it as follows:

command="/usr/bin/rrsync -wo /mimer/NOBACKUP/groups/groupname/write_to" ssh-rsa AAAAA...username@local_computer
command="/usr/bin/rrsync -ro /mimer/NOBACKUP/groups/groupname/read_from" ssh-rsa AAAAA...username@local_computer

The command part restricts each key to be write-only and read-only respectively, to a specific folder. You should now be able to write to /mimer/NOBACKUP/groups/groupname/write_to by running

$ rsync -av /folder/on/local/computer alvis2-write:

and read from /mimer/NOBACKUP/groups/groupname/read_from by running

$ rsync -av alvis2-read: /folder/on/local/computer

To test that usage is restricted to this operation, you can do

$ rsync -av alvis2-write: /folder/on/local/computer
$ rsync -av /folder/on/local/computer alvis2-read:

These operations should fail, if everything is working and has been set up correctly.