Senior researchers at Swedish universities are eligible to apply for SNIC projects on other centres; they may have more time or specialized hardware that suits you; e.g. GPU, large memory nodes, sensitive data.
$SLURM_SUBMIT_DIR is defined in jobs, and points to where you submitted your job.
Try to avoid lots of small files: sqlite or HDF5 are easy to use!
Storing data - TMPDIR
$TMPDIR: local scratch disk on the node(s) of your jobs. Automatically deleted when the job has finished.
When should you use $TMPDIR?
The only good reason NOT to use $TMPDIR is if your program only loads data in one read operation, processes it, and writes the output.
It is crucial that you use $TMPDIR for jobs that perform intensive file I/O
If you're unsure what your program does: investigate it, or use $TMPDIR!
Storing data - TMPDIR
Using $TMPDIR means the disks on the compute node will be used
Using /cephyr/users/<CID>/ means the network-attached disks on the center storage system will be used
The latter means there will be both network traffic on the shared network, and I/O activity on the shared storage system
Currently the disks is 1600GB on each node on Hebbe, 380GB (SSD) on each node on Alvis and Vera.
Using sbatch --gres=ptmpdir:1 you get a distributed, parallel $TMPDIR across all nodes in your job. Always recommended for multi-node jobs that use $TMPDIR.
projinfo lists your projects and current usage. projinfo -D breaks down usage day-by-day (up to 30 days back).
Project Used[h] Allocated[h] Queue
C3SE2017-1-8 15227.88* 10000 hebbe
robina 2035.88* <-- star means we are over 100% usage
dawu 150.59* which means this project has lowered priority
C3SE507-15-6 9035.27 28000 mob
On compute clusters jobs must be submitted to a queuing system that starts your jobs on the compute nodes:
sbatch <arguments> script.sh
Jobs must NOT run on the login nodes. Prepare your work on the front-end, and then submit it to the cluster
A job is described by a script (script.sh above) that is passed on to the queuing system by the sbatch command
Arguments to the queue system can be given in the script.sh as well as on the command line
Maximum wall time is 7 days (we might extend it manually if you're in a panic), but it's for your own good to make the job restartable!
When you allocate less than a full node, you are assigned a proportional part of the node's memory and local disk space as well.
-C MEM64 requests a 64GB node - there are 192 of those
-C MEM128 requests a 128GB node - there are 34 of those
-C MEM512 requests a 512GB node - there is 1 of those
-C MEM1024 requests a 1024GB node - there is 1 of those
-C "MEM512|MEM1024" requests either 512GB or 1TB node
--gres=gpu:1 requests 1 K40 GPU (and a full node associated with it)
Don't specify constraints (-C) unless you know you need them.
Your job goes in the normal queue, but waits for a node with the requested configuration to become available
Running jobs on Hebbe
On Hebbe you can allocate individual CPU cores, not only nodes (up to the full 20 cores on the node)
Your project will be charged only for the core hours you use
If you request more than 1 node, you get all cores (and pay the core hours for them)
Try to stick to a divisor of 20 in number of tasks when sharing nodes.
Running jobs on Vera
Same deal as with Hebbe:
-C MEM96 requests a 96GB node - 168 total (some private)
-C MEM192 requests a 192GB node - 17 total (all private)
-C MEM384 requests a 384GB node - 7 total (5 private, 2 GPU nodes)
-C MEM768 requests a 768GB node - 2 total
-C 25G requests a node with 25Gbit/s storage and internet connection (nodes without 25G still uses fast Infiniband for access to /cephyr).
--gres=gpu:1 requests 1 GPU (half the GPU node is allocated)
--gres=gpu:2 requests 2 GPUs (etc.)
Don't specify constraints (-C) unless you know you need them.
Running jobs on Vera
On Vera you can still only allocate per individual cores, which means you will end up an even number of threads (tasks).
If you request more than 1 node, you get all cores and threads (and pay the core hours for them)
You can only ask for an even number of threads in total (it will be rounded up to the next even number)
If you use the -n parameter, you are requesting tasks.
-c "CPUs" per task (as Vera has HyperThreading enabled, this might be relevant for many jobs).
The combination of -n and -c affects how mpirun will launch.
A full node has 64 threads. You may want to control how mpirun distributes processes.
Your project will be charged only for the core hours you use.
Try to stick to a power of 2 number of tasks when sharing nodes.
Benchmark your code to see if -n X -c 2 or -n 2*X runs faster.
Running jobs on Alvis
Alvis is dedicated to GPU-hungry computations, therefore your job must allocate at least one GPU
On Alvis you can allocate individual cores (tasks)
Hyperthreading is disabled on Alvis
Alvis comes in three phases (I, II, and III), and there is a variety in terms of:
number of cores
number and type of GPUs
memory per GPU
memory per node
Pay close attention to the above-mentioned items in your job submission script to pick the right hardware
for instance, phase Ia comes with NVIDIA V100 GPUs, while phase Ib is equipped with T4 GPUs
Allocating GPUs on Alvis
You can specify the number of GPUs and let the scheduler decide the type
You can also specify the type (recommended):
Currently, mixing GPUs of different types is not allowed
Vera script example
Note: You can (currently) only allocate a minimum of 1 core = 2 threads on Vera
#!/bin/bash#SBATCH -A C3SE2018-1-2## Note! Vera has hyperthreading enabled:## n * c = 128 threads total = 2 nodes## This should launch 32 MPI-processes on each node.#SBATCH -n 64#SBATCH -c 2#SBATCH -t 2-00:00:00#SBATCH --gres=ptmpdir:1module load ABAQUS intel
cp train_break.inp $TMPDIRcd$TMPDIRabaqus cpus=$SLURM_NTASKS mp_mode=mpi job=train_break
cp train_break.odb $SLURM_SUBMIT_DIR
#!/bin/bash#SBATCH -A C3SE2017-1-2#SBATCH -n 40 -t 2-00:00:00module load intel/2017a
# Set up new folder, copy the input file theretemperature=$SLURM_ARRAY_TASK_IDdir=temp_$temperaturemkdir$dir;cd$dircp$HOME/base_input.in input.in
# Set the temperature in the input file:sed -i 's/TEMPERATURE_PLACEHOLDER/$temperature' input.in
mpirun$HOME/software/my_md_tool -f input.in
Here, the array index is used directly as input. If it turns out that 50 degrees was insufficient, then we could do another run:
(echo $DISPLAY should read something like localhost:XX.0)
Interactive use and X-forwarding
X11 forwarding in SLURM is still a big experimental.
If you get the error xauth: error in locking authority file ..., then you need to remove the lock files: rm ~/Xauthority-c ~/Xauthority-l and try again.
If you get the error srun: error: run_command: xauth poll timeout @ 100 msec just try again. This is a known bug in SLURM that causes the problems above and will hopefully be fixed in the next release.
Job command overview
sbatch: submit batch jobs
srun: submit interactive jobs
jobinfo, squeue: view the job-queue and the state of jobs in queue
scontrol show job <jobid>: show details about job, including reasons why it's pending
sprio: show all your pending jobs and their priority
scancel: cancel a running or pending job
sinfo: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.
sacct: show scheduling information about past jobs
projinfo: show the projects you belong to, including monthly allocation and usage
For details, refer to the -h flag, man pages, or google!
Why am I queued? jobinfo -u $USER:
Priority: Waiting for other queued jobs with higher priority.
Resources: Waiting for sufficient resources to be free.
AssocGrpCPURunMinutesLimit: We limit how much you can have running at once (<= 100% of 30-day allocation * 0.5^x where x is the number of stars in projinfo).
You can log on to the nodes that your job got allocated by using ssh (from the login node) as long as your job is running. There you can check what your job is doing, using normal Linux commands - ps, top, etc.
top will show you how much CPU your process is using, how much memory, and more. Tip: press 'H' to make top show all threads separately, for multithreaded programs
iotop can show you how much your processes are reading and writing on disk
Performance benchmarking with Allinea Forge, Intel VTune
Debugging with Allinea Map, gdb, Address Sanitizer, or Valgrind
Running top on your job's nodes:
sinfo -Rl command shows how many nodes are down for repair.
The health status page gives an overview of what the node(s) in your job are doing
Check e.g. memory usage, user, system, and wait CPU utilization, disk usage, etc
See summary of CPU and memory utilization (only available after job completes): seff JOBID
System status information for each resource is available through the C3SE homepage:
The ideal job, high CPU utilization and no disk I/O
Looks like something tried to use 2 nodes incorrectly.
One node swapped to death, while the other was just idling.
Node waiting a lot, not great. Perhaps inefficient I/O use.
Things to keep in mind
Never run (big or long) jobs on the login node! If you do, we will kill the processes. If you keep doing it, we'll throw you out and block you from logging in for a while! Prepare your job, do tests and check that everything's OK before submitting the job, but don't run the job there!
Keep an eye on what's going on - use normal Linux tools on the login node and on the allocated nodes to check CPU, memory and network usage, etc. Especially for new jobscripts/codes!
Think about what you do - if you by mistake copy very large files back and forth you can slow the storage servers or network to a crawl
We provide support to our users, but not for any and all problems
We can help you with software installation issues, and recommend compiler flags etc. for optimal performance
We can install software system-wide if there are many users who need it - but not for one user (unless the installation is simple)
We don't support your application software or help debugging your code/model or prepare your input files
C3SE staff are available in our offices, to help with those things that are hard to put into a support request email (book a time in advance please)
Rooms O5105B, O5110 and O5111 Origo building - Fysikgården 1, one floor up, ring the bell to the right
We also offer advanced support for things like performance optimization, advanced help with software development tools or debuggers, workflow automation through scripting, etc.
Getting support - support requests
If you run into trouble, first figure out what seems to go wrong. Use the following as a checklist:
something wrong with your job script or input file?
does your simulation diverge?
is there a bug in the program?
any error messages? Look in your manuals, and use Google!
check the node health: Did you over-allocate memory until linux killed the program?
Try to isolate the problem - does it go away if you run a smaller job? does it go away if you use your home directory instead of the local disk on the node?
Try to create a test case - the smallest and simplest possible case that reproduces the problem
Getting support - error reports
In order to help you, we need as much and as good information as possible:
What's the job-ID of the failing job?
What working directory and what job-script?
What software are you using?
What's happening - especially error messages?
Did this work before, or has it never worked?
Do you have a minimal example?
No need to attach files; just point us to a directory on the system.
Where are the files you've used - scripts, logs etc?