Bulk data transfer to and from Alvis¶
Any larger transfers to and from Alvis should make sure to use alvis2 log-in node which is the dedicated data transfer node on Alvis.
ADDS - the Alvis Data Downloader Service¶
The ADDS system is a service offered on Alvis which allows you to submit background tasks to download datasets consisting of a large number of individual files over HTTP/HTTPS to local storage.
If the data is stored on a different kind of storage resource like on Azure, S3, SSH/SFTP to another cluster, etc. then a tool like rclone
is more suitable.
Data transfer jobs run in the background on a storage login node. The user-facing interface is a command line tool addsctl
.
Datasets¶
A dataset consists of a set of URLs to individual data files which shall be downloaded into a directory on local storage. The tools on offer accept text files with one URL per line and prepares tasks for the downloader backend to work through.
The addsctl
tool¶
For command line access on the I/O login node alvis2
there is an addsctl
tool that can report the status of pending and completed tasks, as well as convert file listings for a dataset into a download request. The tool has several subcommands.
addsctl request DATASET BASEDIR URLFILE
¶
Schedule a download for a dataset into a given directory on cluster storage which downloads all the files from the URL list file provided on the command line.
The URL list file should contain one plaintext URL per line. Note that as the files are collected in a single directory if multiple links have the same filename the resulting files will collide.
Links are deduplicated and fed to the download backend. Progress information can be queried with the status
subcommand or by looking in the book-keeping directories outlined below. Files are stored directly in the dataset directory without any nested directories.
addsctl status [DATASET]
¶
Reports approximate status for the download of a specific dataset or for all known datasets.
State directory structure for advanced users¶
The downloader state is kept in a hidden directory .alvis-data-downloader
under the user home directory.
Request information is stored in JSON files that move across the directories as downloading progresses and can be directly inspected if the frontend tools are not sufficient.
rclone¶
rclone
is a tool for copying data to, from, and between various cloud storage services. It comes with a large number of configuration templates for common services, like those offered by Microsoft, Google and Amazon.
Creating a configuration with authentication credentials for most services is
easiest to do in an interactive desktop session, which you can most easily access via the Alvis portal.
Before you start, locate the rclone documentation pages for usage of the specific service you are using.
Then, open an OnDemand interactive desktop session, open a terminal, and type
Enter n
for "new", and choose a name for your configuration (such as the name of the service). Refer to the rclone documentation.
If the service you are using has already provided you with an authentication token, you can enter it directly. Otherwise, in most cases rclone
will prompt you to open a link in a browser and log in to the service, in order to obtain the necessary credentials.
Once you have set up the configuration, you can check the files and directories on the cloud service using
To see the detailed contents of a specific directory, you can use
$ rclone lsl NAME_OF_YOUR_CONFIGURATION:/my_directory
1604608 2024-04-04 02:23:12.0000000 my_other_file.txt
To copy files from the service to the local file area, use
and vice versa (assuming you have configured rclone for both retrieving and uploading data):
For further details, consult man rclone
, the online documentation, or type rclone --help
.
rrsync - restricting ssh key usage¶
rrsync
is a utility which allows you to restrict SSH private-public key pairs to only function for a certain subset of rsync
operations. A typical case would be that you have a folder of important data, which you are worried about accidentally overwriting. Moreover, you have other important folders, and you don't want to accidentally overwrite those either. So you want to make sure that your usage is restricted to reading from a specific folder, and writing to a specific folder.
For the purpose of this example, we will assume you have a folder on the mimer
system, e.g. /mimer/NOBACKUP/groups/groupname
, and in there two sub-folders, write_to
and read_from
. Our goal is to set up a configuration whereby there is no risk of accidentally overwriting files outside of write_to
, or reading any other files than those in read_from
.
In order to do this, start by creating 2 ssh
keys on your local computer:
Print the public keys and/or copy them by your preferred means to alvis1
or alvis2
:
If it doesn't already exist, create a file called authorized_keys
on alvis2
. Paste the two public keys in there, so that your authorized_keys
looks something like
We can test that these keys work by creating a regular ssh
configuration on your computer. Do the following:
Then edit .ssh/config
with a text editor like vim
, and add the following, replacing the value under User
with your CID:
Host alvis2-write
HostName 129.16.125.131
User YOUR_CID
IdentitiesOnly yes
IdentityFile ~/.ssh/write_key
PasswordAuthentication no
Host alvis2-read
HostName 129.16.125.131
User YOUR_CID
IdentitiesOnly yes
IdentityFile ~/.ssh/read_key
PasswordAuthentication no
Test that your new configuration works, by doing ssh alvis2-read
and ssh alvis2-write
from your local computer.
Then go into .ssh/authorized_keys
on alvis2
, and modify it as follows:
command="/usr/bin/rrsync -wo /mimer/NOBACKUP/groups/groupname/write_to" ssh-rsa AAAAA...username@local_computer
command="/usr/bin/rrsync -ro /mimer/NOBACKUP/groups/groupname/read_from" ssh-rsa AAAAA...username@local_computer
The command
part restricts each key to be write-only and read-only respectively, to a specific folder. You should now be able to write to /mimer/NOBACKUP/groups/groupname/write_to
by running
and read from /mimer/NOBACKUP/groups/groupname/read_from
by running
To test that usage is restricted to this operation, you can do
$ rsync -av alvis2-write: /folder/on/local/computer
$ rsync -av /folder/on/local/computer alvis2-read:
These operations should fail, if everything is working and has been set up correctly.