Vera expansion

Vera has been expanded with 26 compute nodes Intel 64 core Icelake CPUs, with either 512GB or 1024GB RAM, and a total of 16 A40 and 8 A100 NVIDIA GPUs. You can find detailed hardware descriptions on the Vera page.

The old hardware is based on the Intel Skylake CPU architecture, so with this addition Vera is now running mixed hardware. Software optimised for Skylake will also work on the new Icelake nodes, but not the other way around.

In addition to the expansion, other major changes will occur on Vera:

Upgrade plan

2022-11-21: Updates are done!

All nodes have been updated!

  1. 2022-10-25: All remaining GPU nodes (V100 and T4) + 48 MEM96 nodes. (done)
  2. 2022-11-08: 64 MEM96 nodes (done)
  3. 2022-11-22: 41 MEM96 nodes + vera1 login node. (done)

All nodes in private queues will be updated 2022-11-22, or earlier if agreed between the PI and C3SE-staff.

OS update and new module tree

In order to take advantage of new features, hardware and software support, the operating system (OS) will be updated from CentOS 7 to Rocky Linux 8. The new Icelake based nodes, along with the vera2 login node, are already running Rocky Linux 8 and an updated Mellanox Infiniband stack. The old nodes will be drained and updated in large blocks, redeploying all nodes and lastly the login node vera1.

With a new major OS version, and a new hardware architecture, a new module tree have been built. This means that old versions of some software will not be available any longer.

The login node vera2 has both the new OS and the new module tree active. Until the transition is complete, you should submit jobs for CentOS 7 nodes from vera1 and jobs for Rocky Linux 8 nodes from vera2!

NOTE: Due to an oversight, some SSH host keys were lost in the update. If your SSH client stored the old keys, you will get a warning "remote host identification has changed". This is harmless, and you can follow the instructions given by your ssh client for how to get rid of the old cached key and get rid of the warning.

If your favourite software isn't available on vera2, please contact support. User built software, like a virtualenv made on CentOS 7, may not work and you should re-create them for Rocky Linux 8. Containers and commercial software (like MATLAB) will likely be completely unaffected.

Hyper-threading will be disabled

Due to the lack of software support, and a large confusion regarding job submissions for new users, it have been decided to disable Hyper-threading on Vera. This means many jobs that once specified

#SBATCH -c 2

should no longer do so when submitting jobs to updated nodes (i.e. Rocky Linux 8 nodes).

During this transition, the usage reported with projinfo will be wrong; jobs on new and updated nodes will effectively cost half as much (until the transition is complete).

Selecting CPU architecture

For GPUs jobs, the corresponding CPU type is implied (see hardware), and there is no need to specify anything other than the GPU model like normal, e.g.

#SBATCH --gpus-per-node=A40:1

For CPU jobs, you can explicitly select Skylake or Icelake architecture using

#SBATCH -C SKYLAKE

or

#SBATCH -C ICELAKE

When requesting a specific node type via MEMXXX constraints, you are also limiting yourself to a specific CPU architecture, e.g. MEM192 will be a 32 core Skylake node as can be seen on the hardware page. If you do not specify any constraint, it will automatically be SKYLAKE.

Use jobinfo to see current queue and to see if some node types are less congested.

Submitting jobs to CentOS 7 or Rocky Linux 8 during the transition

If you are explicitly submitting jobs to the new ICELAKE nodes or new GPUs you don't need to do anything, they all run Rocky 8 from the start.

During the transition you can opt into using SKYLAKE nodes running Rocky Linux 8 (without hyper-threading) by specifying:

#SBATCH --reservation=rocky8  NEWS: All updates are done. Please remove this flag from your jobscripts.

Remember, the default is to use CentOS7 nodes until the transition is complete.

Main vera partition is currently at (last updated 2022-11-22):

#nodes CPU #cores RAM (GB) GPUS Current state
2 Skylake 32 384 2xV100 Rocky 8
3 Skylake 32 96 1xT4 Rocky 8
192 Skylake 32 96 Rocky 8
17 Skylake 32 192 Rocky 8
2 Skylake 32 768 Rocky 8
4 Icelake 64 512 4xA40 Rocky 8
2 Icelake 64 512 4xA100 Rocky 8
14 Icelake 64 512 Rocky 8
6 Icelake 64 1024 Rocky 8

All nodes will be updated eventually, including private partitions. Everyone should prepare and test submitting jobs on Rocky 8 node as soon as possible.

Private partitions

Private nodes will be drained and updated last or in agreement with each PI. Users should check out the new module tree on vera2 to prepare for the transition.

MStud partition

The private mstud partition for students will be merged into the main queue. Existing projects will have their allocation moved to the main vera partition instead, so any job scripts must remove -p mstud or change it to -p vera.

Upcoming features

Following the OS update, a few important improvements are in the planning phase:

  • Automatic CPU and Memory limits for each user on login nodes, preventing abuse.
  • Direct access to Mimer from Vera.
  • OpenOnDemand portal for interactive jobs.
  • Building containers on login nodes (done!)