TCC SGI Altix ICE Cluster User’s Guide

Introduction

SGI Altix ICE Cluster is a SGI Altix ICE integrated compute environment for high performance computing.

Hardware Overview

  • 22 Compute Nodes
  • Dual Four-Core Intel Xeon X5365 Processors Per Node (8 Logical Processors) - CPU Speed: 3.00 GHz - Bus Speed: 1333 MHz - L2 Cache: 8MB - Arch: x86_64
  • 16 GB Memory per node
  • 4x DDR Infiniband Interconnect.
  • 1 TB User File Storage (1 GB User Quota)
  • 8.8 TB Scratch Temporary File Storage

Software Overview

Operating System
SUSE Linux Enterprise Server 10.1 (x86_64)
MPI
mvapich2-1.0-1
Compilers
Intel 10.1
Libraries
Intel Math Kernel Library 10.0 Update 3

Access

Requesting Time On The Cluster

Accounts

SGI Altix ICE Cluster accounts are completely separate from your TCC account. User ID, password, and file storage are not shared between accounts. You can transfer data between your TCC account and SGI Altix ICE Cluster account through scp or sftp described below.

Your SGI Altix ICE Cluster account has 1 GB of user space on /home/. There is 8.8 TB of shared temporary scratch space on /scratch/

Note

Please manually backup any important data to your TCC account or elsewhere

Login

Login using secure-shell (ssh): ssh [user]@ice1.nmt.edu

Transfer files to the Linux cluster from your home machine or TCC account using secure copy(scp) or secure ftp (sftp): scp [file] [user]@ice1.nmt.edu:[dest] sftp [user]@ice1.nmt.edu

Requesting Resources

Warning

Do not run jobs on service0. Please use it for compilation and small debugging runs only.

Resources are requested through qsub.

A required option for all resource requests is walltime. By default all jobs have default walltime of 0 minutes. You can specify a walltime for your job with a -l option to qsub in the form of -l walltime=1:00 for one minute, or -l walltime=1:00:00 for one hour. There is currently no upper limit for walltime, the system is in place to help the scheduler.

Software

Selecting Software Using Module

The cluster has several software packages that serve the same purpose and have the same executable name. To ensure the software package you want to use is being executed when you run the command, you will need to set up your shell environment to prefer the package you want. This is done through the Modules package and the module command.

Using The Module Command

Module Command
If the `module`command is not found, you need to manually source its configuration by entering : #For Bash enter: source /etc/profile.d/modules.sh #For CSH enter source /etc/profile.d/modules.csh
Listing Modules
module avail The avail subcommand lists the modules that are available to load into your environment.
khan@service0:~> module avail
------------------ /usr/share/modules -------------------
3.1.6                      modulefiles/mvapich_gcc
modulefiles/dot            modulefiles/mvapich_intel
modulefiles/module-cvs     modulefiles/null
modulefiles/module-info    modulefiles/openmpi_gcc
modulefiles/modules        modulefiles/openmpi_intel
modulefiles/mvapich2_gcc   modulefiles/perfcatcher
modulefiles/mvapich2_intel modulefiles/use.own
------------ /usr/share/modules/modulefiles -------------
dot            mvapich2_intel openmpi_intel
module-cvs     mvapich_gcc    perfcatcher
module-info    mvapich_intel  use.own
modules        null
mvapich2_gcc   openmpi_gcc
--------------- /opt/intel/intel_modules ----------------
cc/10.1.015  cce/10.1.015 fc/10.1.015  mpi/3.1.026
Loading Modules
module load *modulename* module load sets up software into your environment.
::
khan@service0:~> module load mvapich2_gcc khan@service0:~> module list 1) mvapich2_gcc khan@service0:~> which mpicc /usr/mpi/mvapich2-1.0-1/gcc/bin/mpicc
Unloading Modules
module unload *modulename* module unload removes that software from your environment.
khan@service0:~> module list
1) mvapich2_gcc
khan@service0:~> module unload mvapich2_gcc
khan@service0:~> which mpicc
which: no mpicc in (/usr/local/bin:/usr/bin:...)

Note

If you want to have these modules be available to you each time you log in, place the commands to load those modules in a file called .bashrc in your home folder.

MPI Libraries

MPT: SGI Message Passing Toolkit

Note

MPT is the default MPI installation. It is available without using module

Message Passing Toolkit (MPT) is a software package that supports interprocess data exchange for applications that use concurrent, cooperating processes on a single host or on multiple hosts. Data exchange is done through message passing, which is the use of library calls to request data delivery from one process to another or between groups of processes.

For more information on MPT, see SGI TechPub: 007-3773-007

The MPT package contains the following components and the appropriate accompanying documentation:

  • Message Passing Interface (MPI). MPI is a standard specification for a message passing interface, allowing portable message passing programs in Fortran and C languages.
  • he SHMEM programming model. The SHMEM programming model is a distributed, shared-memory model that consists of a set of SGI-proprietary message-passing library routines. These routines help distributed applications efficiently transfer data between cooperating processes. The model is based on multiple processes having separate address spaces, with the ability for one process to access data in another process’ address space without interrupting the other process. The SHMEM programming model is not a standard like MPI, so SHMEM applications developed on other vendors’ hardware might or might not work with the SGI SHMEM implementation.

Compiling and Linking MPI Programs with MPT

To compile using GNU compilers, choose one of the following commands

  • g++ -o myprog myprog.cpp -lmpi++ -lmpi
  • gcc -o myprog myprog.c -lmpi
  • gfortan -I/usr/include -o myprog myprog.f -lmpi

To compile programs with the Intel compiler, use the following commands:

  • ifort -o myprog myprog.f -lmpi
  • icc -o myprog myprog.c -lmpi

Running MPT MPI Jobs using Portable Batch System (PBS)

Schedule a session with PBS using qsub.

Each MPI application is executed with the mpiexec command that is delivered with the PBS Pro software packages. This is a wrapper script that assembles the correct host list and corresponding mpirun command before executing the assembled mpirun command. The basic syntax is, as follows:

mpiexec -n P ./a.out

where P is the total number of MPI processes in the application. This syntax applies whether running on a single host or a clustered system. See the mpiexec(8) man page for more details.

MVAPICH2: MPI over InfiniBand

MVAPICH is open source software developed largely by the Network-Based Computing Laboratory (NBCL) at Ohio State University. MVAPICH develops the Message Passing Interface (MPI) style of process-to-process communications for computing systems employing Infiniband and other Remote Direct Memory Access (RDMA) interconnects.

For more descriptions including the MVAPICH User Guide and other MVAPICH publications, see http://mvapich.cse.ohio-state.edu.

MVAPICH applications use the Infiniband network of SGI Altix ICE 8200 systems for interprocess RDMA communications. SGI Altix ICE 8200 systems are configured with two Infiniband fabrics, designated as ib0 and ib1. In order to maximize performance, SGI advises that the ib0 fabric be used for all MPI traffic, including MVAPICH MPI. The ib1 fabric is reserved for storage related traffic. The default configuration for MVAPICH MPI is to use only the ib0 fabric.

Compiling and Linking MPI Programs with MVAPICH2

To compile using GNU compilers

Load the mvapich2 gcc module

  • module load mvapich2_gcc

choose one of the following commands

  • mpicxx -o myprog myprog.cpp
  • mpicc -o myprog myprog.c
  • mpif77 -o myprog myprog.f
  • mpif90 -o myprog myprog.f

To compile using Intel compilers

Load the mvapich2 intel and intel compiler modules

  • module load mvapich2_intel
  • module load module load cce/10.1.015

Choose one of the following compiler commands

  • mpicxx -o myprog myprog.cpp
  • mpicc -o myprog myprog.c
  • mpif77 -o myprog myprog.f
  • mpif90 -o myprog myprog.f

Running MVAPICH2 MPI Jobs using Portable Batch System (PBS)

First configure mpd, create the $HOME/.mpd.conf file with perms 0x600

# cat $HOME/.mpd.conf
MPD_SECRETWORD=secretword

Change “secretword” to something secret, it is your “password” for mpd.

Schedule a session with PBS using qsub.

Load the mvapich2 module you used to compile:

  • module load mvapich2_gcc
  • module load mvapich2_intel

Boot the MPD multiprocessing daemons.

mpdboot starts MPDs on the nodes you have access to. These MPDs make the nodes into a “virtual machine” that can run MPI programs. When you run an MPI program under mvapich2, requests are sent to MPD daemons to start up copies of the program.

mpdboot -n P -f $PBS_NODEFILE

Where P is the number of nodes requested from PBS.

Launch the program with mpiexec

mpiexec -np P ./a.out

Where P is the total number of MPI processes in the application.

Clean up the mpi environment.

mpdallexit

OpenMPI

E-Mail

The batch system will notify you about your jobs via email to your local account on service0. Look at the manual page for qsub, specifically options -M and -m for more email options.

By default mail will be stored locally on service0. You can either read mail locally or forward it to another system.

Forwarding Mail
To forward mail to your TCC, department, or other email account use a ~/.forward file. Place addresses to forward to on separate lines. If you would like to keep a local copy of the mail insert a blackslash followed by your local username on the last line of the file.
Checking Mail Locally
Use mutt to check your mail locally on service0. It is already configured to read from the local spool.

Monitoring

Monitoring Job Status

qstat
qstat [*jobid*] The qstat utility allows users to display the status of jobs and list the batch jobs in queues. The operands of the qstat utility may be either job identifiers, queues (specified as destination identifiers), or batch server names. The other options of the qstat utility allow the user to control the amount of information displayed and the format in which it is displayed. The -f option allows users to request a “full” display of job, queue, or server information.
tracejob
tracejob *jobid* PBS includes the tracejob utility to extract daemon/service logfile messages for a particular job (from all log files available on the local host) and print them sorted into chronological order. .. note:: Note that the third column of the display contains a single letter (S, M, A, or L) indicating the source of the log message (Server, MOM, Accounting, or scheduler log files).

Monitoring Cluster Status

pbsnodes
The pbsnodes command is used to query the status of hosts. pbsnodes -l pbsnodes -l will list the nodes that are currently down. pbsnodes -a pbsnodes -a will list extended information on each node. Such as, node state, currently running jobs, and resources.
Ganglia
Ganglia web monitoring - http://ice1-admin.nmt.edu/ganglia - still firewalled.

Examples

MPI with MVAPICH-2 - Interactive Batch Session - Walkthrough

MVAPICH2 MPI Hello World

Create hello.c

#include "mpi.h"
#include <stdio.h>

int main(int argc, char \**argv){

int myRank, numProcs;
char processorName[MPI_MAX_PROCESSOR_NAME];
int  nameLen;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Get_processor_name(processorName,&nameLen);
printf("Hello World! From Process: %d/%d on %s\\n",
myRank+1,numProcs,processorName);

MPI_Finalize();

return 0;
}

Load mvapich2_gcc module for compilation.

module load mvapich2_gcc

Compile hello.c with mpicc

mpicc hello.c -o hello

Request resources from the batch system to run the application.

Here we will be requesting an interactive batch session on 2 nodes each using all 8 processors (16 processors total).

qsub -I -l walltime=1:00:00,select=2:ncpus=8:mpiprocs=8

khan@service0:~> qsub -I -l walltime=1:00:00,select=2:ncpus=8:mpiprocs=8
qsub: waiting for job 100.service0-ib0 to start
qsub: job 100.service0-ib0 ready
khan@r1i0n0:~>

Reload mvapich2_gcc module for execution.

The request to the batch manager established a new shell session. We need to reload the environment with module.

module load mvapich2_gcc

Boot the MPD multiprocessing daemons.

mpdboot starts MPDs on the nodes you have access to. These MPDs make the nodes into a “virtual machine” that can run MPI programs. When you run an MPI program under mvapich2, requests are sent to MPD daemons to start up copies of the program.

Here we boot up on 2 nodes, and specify the nodes the batch system has allocated us.

mpdboot -n 2 -f $PBS_NODEFILE

Launch the hello program.

Now that the environment is set up, the mpi program can run. mpiexec will find the mpd environment and pass processes across it. Here we are launching 16 processes, one for each cpu.

mpiexec -np 16 ./hello

khan@r1i0n0:~> mpiexec -np 16 ./hello
Hello World! From Process: 4/16 on r1i0n1
Hello World! From Process: 2/16 on r1i0n1
Hello World! From Process: 3/16 on r1i0n0
Hello World! From Process: 6/16 on r1i0n1
Hello World! From Process: 13/16 on r1i0n0
Hello World! From Process: 7/16 on r1i0n0
Hello World! From Process: 1/16 on r1i0n0
Hello World! From Process: 9/16 on r1i0n0
Hello World! From Process: 8/16 on r1i0n1
Hello World! From Process: 16/16 on r1i0n1
Hello World! From Process: 11/16 on r1i0n0
Hello World! From Process: 10/16 on r1i0n1
Hello World! From Process: 12/16 on r1i0n1
Hello World! From Process: 15/16 on r1i0n0
Hello World! From Process: 14/16 on r1i0n1
Hello World! From Process: 5/16 on r1i0n0

Shutdown the MPI environment.

The mpd processes need to be properly shutdown, if you leave the session and there are mpd processes still running, they can prevent another mpdboot from working properly.

mpdallexit

Exit the batch session.

Simply logout of the shell given by the batch manager.

exit

khan@r1i0n0:~> exit
logout
qsub: job 100.service0-ib0 completed
khan@service0:~>