HyperQueue¶

HyperQueue is a tool designed to simplify execution of large workflows on HPC clusters. It allows you to execute a large number of tasks without having to manually submit jobs into Slurm. HyperQueue is available on Alvis and accessible on the command line using hq.

Useful Links¶

Example¶

Here's a minimal example on how to run and submit jobs unto HyperQueue on Alvis. HyperQueue does not require any special privileges to run and a deployment consists of two parts; a Server with one or more Workers.

To get access to all of HyperQueue's functionality you could break apart the batch job example below and run the Server component separately, letting it handle the tasks submission by itself. Please refer to the official documentation on Architecture, Server and Automatic Allocation on how to setup a more intricate environment.

After saving the files to the cluster, the example would be submitted as sbatch hyperqueue-batch.sh. Make sure to update the account ID in the #SBATCH --account=<ACCOUNT ID> variable below. Read the code comments for light commentary on what does what.

hyperqueue-batch.sh hyperqueue-task.py

#!/bin/bash
#SBATCH --account=<ACCOUNT ID>
#SBATCH --partition alvis
#SBATCH --nodes=1               # Request one compute node.
#SBATCH --ntasks-per-node=1     # Run one HyperQueue worker.
#SBATCH --cpus-per-task=4       # Set number of cpus per HyperQueue worker.
#SBATCH --gpus-per-node=T4:1
#SBATCH --time=0-00:15:00

# Begin by starting the Server and wait until it's ready.
hq server start &
until hq job list &> /dev/null ; do sleep 1 ; done

# After the Server is ready, start the Worker and wait until the Worker is
# ready to recieve tasks.
srun --overlap \
    hq worker start --manager slurm --no-detect-resources &

hq worker wait "$SLURM_NTASKS"

# The output from submitted Tasks will be streamed to a directory in this case
# "output/" using the "--stream" option.
mkdir -p output/

# Start to submit an array of tasks for the Server to distribute and Worker(s)
# to execute.
module load numba
hq submit --stream=output --cpus=1 --array=1-1000 \
    python hyperqueue-task.py

# Wait until all tasks have finished.
hq job wait all

# Gracefully stop both the Worker and Server.
hq worker stop all
hq server stop

import os
import sys
import numpy
from numba import jit, cuda
from timeit import default_timer as timer

@jit(target_backend='cuda',nopython=True)
def gpu_calculation(array):
    for i in range(10000000):
        array[i]+= 1

if __name__=="__main__":
    task_id = os.environ.get("HQ_TASK_ID", "N/A")

    print("task_id=%s starting" % task_id, file=sys.stderr)
    array = numpy.ones(10000000)

    start = timer()
    gpu_calculation(array)
    print("task_id=%s calculation time: %s" % (task_id, timer()-start))

    print("task_id=%s done" % task_id, file=sys.stderr)

The file hyperqueue-task.py would be replaced with the tasks you want to run. For HyperQueue specifics it's worth noting the HQ_TASK_ID environment variable which is set with an unique number for every task. The script also prints different parts of its output to stdout/stderr. When running you're able to filter the output viewing either or both.

Results¶

While the Server and Worker are running you're able to inspect the ongoing work using the hq output-log command. If you've run several batches you might have to specify a Server UID using the --server-uid option. The Server UID is shown if you take a look at the running server using hq server info.

$ hq server info
+-------------+-------------------------+
| Server UID  | 5DFRJd                  |
| Client host | alvis2-01               |
| Client port | 36777                   |
| Worker host | alvis2-01               |
| Worker port | 33181                   |
| Version     | 0.20.0-dev              |
| Pid         | 2186549                 |
| Start date  | 2024-12-03 07:20:54 UTC |
+-------------+-------------------------+

$ hq output-log --server-uid=5DFRJd output summary
+-------------------------------+-----------------------+
| Path                          | output                |
| Files                         | 1                     |
| Jobs                          | 1                     |
| Tasks                         | 284                   |
| Opened streams                | 0                     |
| Stdout/stderr size            | 13.67 KiB / 10.48 KiB |
| Superseded streams            | 0                     |
| Superseded stdout/stderr size | 0 B / 0 B             |
+-------------------------------+-----------------------+

As of writing, version v0.20 of HyperQueue has the command hq dashboard disabled by the developers. But if you're running a later version give the command a try to view the built-in HyperQueue Dashboard.

With the above batch job running or completed you're able to inspect the output generated by HyperQueue in the output/ directory. Quite smart is the functionality to filter the output based on which channel is as output to.

$ hq output-log --server-uid=5DFRJd output show --channel stdout
1.0186:0> task_id=186 calculation time: 0.6406302459072322
1.0189:0> task_id=189 calculation time: 0.6401626951992512
1.0375:0> task_id=375 calculation time: 0.6405851498711...

$ hq output-log --server-uid=5DFRJd output show --channel stderr
1.0186:1> task_id=186 starting
1.0189:1> task_id=189 starting
1.0375:1> task_id=375 starting
1.0564:1> task_id=564 starting
1.0186:1> task_id=186 done
1.0189:1> task_id=189 done
1.0375:1> task_id=375 done
1.0564:1> task_id=564 d...