Celeritas

A reference

Revision History
Revision 0.01	2006-Dec-12
First draft posted online on Dec-12
Revision 0.02	2007-Jan-14
Made some minor modifications, removed the To-Do section, added approrpriate links, modified the MPI subsection
Revision 0.03	2007-Jan-17
Created a table of software listing, added an example code that uses ATLAS
Revision 0.04	2007-Jan-18
All codes listed in here are also available at my home page. The credits section has been added

Abstract

This documentation is intended to serve as a starter's guide and a quick reference to the Celeritas cluster at the Center for Computation and Technology, Louisiana State University.

The targeted audience are the students of Introduction to High Performance Computing, CSC 7600

List of Tables

4.1. Celeritas Software Listing

Chapter 1. System Architecture

Table of Contents

Introduction
Network settings
Operating System
Hardware
Monitoring

Introduction

Celeritas is a traditional Beowulf cluster. The machine celeritas.cct.lsu.edu is the main front end machine, and 8 compute machines are connected to this node through a local area ethernet network. This chapter explores more into the internals of the cluster.

Network settings

The frontend machine itself has 2 network interfaces, one for the wide area network, used for accessing the cluster from other computers and one for the local area network, which connects the 8 compute nodes.

The IP address of the front end is 130.39.128.68 and it resolves to a DNS name of celeritas.cct.lsu.edu. The machines are all connected through a gigabit ethernet switch. The /etc/hosts lists the IP addresses of the compute nodes.

				$cat/etc/hosts
				#
				# Do NOT Edit (generated by dbreport)
				#
				127.0.0.1       localhost.localdomain   localhost
				192.168.1.1     celeritas.local celeritas # originally frontend-0-0
				192.168.1.254   compute-0-0.local compute-0-0 c0-0
				192.168.1.253   compute-0-1.local compute-0-1 c0-1
				192.168.1.252   compute-0-2.local compute-0-2 c0-2
				192.168.1.251   compute-0-3.local compute-0-3 c0-3
				192.168.1.250   compute-0-4.local compute-0-4 c0-4
				192.168.1.249   compute-0-5.local compute-0-5 c0-5
				192.168.1.248   compute-0-6.local compute-0-6 c0-6
				192.168.1.247   compute-0-7.local compute-0-7 c0-7
				130.39.128.68   celeritas.cct.lsu.edu

Students only have access to the front end itself. The compute machines, having hostnames compute-0-0 through compute-0-7, can not be accessed directly, and even though they can be accessed through the front end, there is little incentive to do so. All your work will be done through the front end.

In addition to the gigabit ethernet switch, the machines are also connected using the myrinet interconnect. The command /home/packages/mx-1.1.5/bin/mx_info gives more information regarding the connection.

				$/home/packages/mx-1.1.5/bin/mx_info
					MX Version: 1.1.5
					MX Build: root@celeritas.cct.lsu.edu:/home/sources/mx-1.1.5 Thu Dec  7 20:57:47 CST 2006
					1 Myrinet board installed.
					The MX driver is configured to support up to 4 instances and 1024 nodes.
					===================================================================
						Instance #0:  224.9 MHz LANai, 132.9 MHz PCI bus, 2 MB SRAM
							Status:         Running, P0: Link up
							MAC Address:    00:60:dd:47:d8:fc
							Product code:   M3F-PCIXD-2 V2.2
							Part number:    09-03034
							Serial number:  284897
							Mapper:         00:60:dd:47:e9:4c, version = 0x55f25eee, configured
							Mapped hosts:   11
													ROUTE COUNT
					INDEX    MAC ADDRESS     HOST NAME                                P0
					-----    -----------     ---------                                ---
					   0) 00:60:dd:47:d8:fc celeritas.cct.lsu.edu:0                   1,1
					   1) 00:60:dd:47:e9:4c compute-1-0.local:0                       1,1
					   2) 00:60:dd:47:d8:1a compute-1-1.local:0                       1,1
					   3) 00:60:dd:47:d9:05 compute-0-0.local:0                       1,1
					   4) 00:60:dd:47:d8:fa compute-0-1.local:0                       1,1
					   5) 00:60:dd:47:d9:04 compute-0-2.local:0                       1,1
					   6) 00:60:dd:47:d9:01 compute-0-3.local:0                       1,1
					   7) 00:60:dd:47:d9:97 compute-0-4.local:0                       1,1
					   8) 00:60:dd:47:d9:03 compute-0-5.local:0                       1,1
					   9) 00:60:dd:47:d8:fb compute-0-6.local:0                       1,1
					  10) 00:60:dd:47:d8:f6 compute-0-7.local:0                       1,1

However, ethernet emulation over myrinet is not enabled, and consequently you can not use the myrinet interface for general TCP/IP related activities. The myrinet interface should solely be used for MPICH purposes only.

Users' home directories are located on a XFS formatted 5 Terabyte storage space on the front end node under /home> and are NFS exported to the compute nodes. Consequently, your binaries need not be propagated to individual child nodes.

Most required softwares are installed under /home/packages and are exported under the same path to the compute nodes.

Operating System

The Celeritas cluster uses the Linux kernel and has the Rocks Cluster Distribution installed. The Rocks Cluster distribution is a Linux distribution based of CentOS that has custom packages added and is modified for easing the deployment of beowulf clusters.

This kernel has been compiled for 64 bit support and has been patched to support performance monitoring. The kernel also has SMP support builtin. Here are some other important details.

				$uname -a
				Linux celeritas.cct.lsu.edu 2.6.9-prep #1 SMP Thu Dec 7 20:32:47 CST 2006 x86_64 x86_64 x86_64 GNU/Linux
				$lsb_release -a
					LSB Version:    :core-3.0-amd64:core-3.0-ia32:core-3.0-noarch:graphics-3.0-amd64:graphics-3.0-ia32:graphics-3.0-noarch
					Distributor ID: CentOS
					Description:    CentOS release 4.4 (Final)
					Release:        4.4
					Codename:       Final
				$/lib64/libc.so.6
					GNU C Library stable release version 2.3.4, by Roland McGrath et al.
					Copyright (C) 2005 Free Software Foundation, Inc.
					This is free software; see the source for copying conditions.
					There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
					PARTICULAR PURPOSE.
					Compiled by GNU CC version 3.4.5 20051201 (Red Hat 3.4.5-2).
					Compiled on a Linux 2.4.20 system on 2006-08-13.
					Available extensions:
						GNU libio by Per Bothner
						crypt add-on version 2.1 by Michael Glad and others
						linuxthreads-0.10 by Xavier Leroy
						The C stubs add-on version 2.1.2.
						GNU Libidn by Simon Josefsson
						BIND-8.2.3-T5B
						libthread_db work sponsored by Alpha Processor Inc
						NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
					Thread-local storage support included.
					For bug reporting instructions, please see:
					<http://www.gnu.org/software/libc/bugs.html>

Hardware

All 9 machines are homogenous SunFire X4200 servers. Each machine has two 70 GB SATA hard disks configured as a RAID mirror, and has a copy of the operating system on it. Additionally, the front end machine has a 5 terabyte Apple XServe Raid Storage attached, which houses the home directories of the users.

Each machine has 2 Dual Core AMD Opteron 64 bit processors in a SMP fashion, for a total of 4 processing cores on each node.

Each machine has 8 GB of shared memory available to the processing cores.

Here are the relevant commands and the corresponding outputs.

				$cat/proc/cpuinfo
				processor       : 3
				vendor_id       : AuthenticAMD
				cpu family      : 15
				model           : 33
				model name      : Dual Core AMD Opteron(tm) Processor 285
				stepping        : 2
				cpu MHz         : 2592.664
				cache size      : 1024 KB
				fpu             : yes
				fpu_exception   : yes
				cpuid level     : 1
				wp              : yes
				flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
				bogomips        : 5184.51
				TLB size        : 1088 4K pages
				clflush size    : 64
				cache_alignment : 64
				address sizes   : 40 bits physical, 48 bits virtual
				power management: ts fid vid ttp

The above is repeated 3 times, one for each processor

			
				$cat/proc/meminfo
				MemTotal:      8046284 kB
				MemFree:       3652576 kB
				Buffers:        149376 kB
				Cached:        3835220 kB
				SwapCached:          0 kB
				Active:        3442996 kB
				Inactive:       661932 kB
				HighTotal:           0 kB
				HighFree:            0 kB
				LowTotal:      8046284 kB
				LowFree:       3652576 kB
				SwapTotal:     2096472 kB
				SwapFree:      2096472 kB
				Dirty:             140 kB
				Writeback:           0 kB
				Mapped:         161516 kB
				Slab:           263932 kB
				Committed_AS:   738012 kB
				PageTables:       6672 kB
				VmallocTotal: 536870911 kB
				VmallocUsed:      4560 kB
				VmallocChunk: 536865787 kB
				HugePages_Total:     0
				HugePages_Free:      0
				Hugepagesize:     2048 kB
				
				$df -h
				Filesystem            Size  Used Avail Use% Mounted on
				/dev/sda1              20G  6.1G   13G  33% /
				none                  3.9G     0  3.9G   0% /dev/shm
				/dev/sda4              42G   11G   30G  27% /export
				/dev/sda2             3.9G  532M  3.2G  15% /var
				tmpfs                 1.9G  4.2M  1.9G   1% /var/lib/ganglia/rrds
				/dev/mapper/VolGroup01-RaidLV01
				                      5.4T  5.0G  5.4T   1% /home

Monitoring

Rocks Clusters ships with an excellent tool, Ganglia, that allows users to monitor all nodes in a cluster through a web based interface. You are welcome to access the Ganglia Monitoring page for Celeritas to have a look at the CPUs and other details.

Chapter 2. Access To Celeritas

Table of Contents

Student Accounts
Logging in

Student Accounts

Celeritas is primarily meant for the CSC 7600 course. As students of the course, you will be assigned an ID of the form cs7600xx and a password. This username and password will be consistent across Celeritas, SuperMike and the online discussion forums. Your password will be a random generated alpha numeric sequence. Do not attempt to change it on any machine, you will receive an error.

Logging in

Access to Celeritas is restricted through SSH only. SSH is a secure and encrypted protocol. You will need a SSH client on your machine to access Celeritas.

If you run Mac OS X or any Linux distribution on your machine, you already have a built in SSH client.

On Mac OS X, launch Terminal.app (Applications->Utilities->Terminal.app).

On Linux, launch any terminal you are familiar with.

On Windows, you will need a small executable file called Putty. You can freely download Putty from here. Download the putty executable and on running it, you will be shown a screen where you are asked the hostname of the machine you want to connect to. Enter celeritas.cct.lsu.edu and you are good to go.

The first time you login, you will be shown the following.

				username@celeritas.cct.lsu.edu's password:
				Creating directory '/home/username'.
				Last login: Sun Dec 10 17:21:31 2006 from px08.cct.lsu.edu
				Rocks 4.2.1 (Cydonia)
				Profile built 01:17 08-Nov-2006
				
				Kickstarted 20:01 07-Nov-2006
				Rocks Frontend Node - Celeritas Cluster
				
				It doesn't appear that you have set up your ssh key.
				This process will make the files:
				     /home/username/.ssh/id_rsa.pub
				     /home/username/.ssh/id_rsa
				     /home/username/.ssh/authorized_keys
				
				Generating public/private rsa key pair.
				Created directory '/home/username/.ssh'.
				Your identification has been saved in /home/username/.ssh/id_rsa.
				Your public key has been saved in /home/username/.ssh/id_rsa.pub.
				The key fingerprint is:
				45:da:ee:54:03:d8:2a:75:c9:18:31:09:02:42:02:3e username@celeritas.cct.lsu.edu
				[username@celeritas ~]$

This process generates the necessary SSH keys that will enable applications like MPICH to run correctly. Do not worry if you are not familiar with SSH keys, they are covered in a later part of this documentation.

At this point, you are logged in to Celeritas. As mentioned earlier, Celeritas runs Linux. If you are familiar with Linux command line tools, you can jump forward to the chapter on the available software. Otherwise, you are encouraged to read the chapter on familiarizing yourself with Linux.

Chapter 3. Introduction to Linux

Table of Contents

The Shell and its commands
I am going to need some help with this ..

Linux contains way too many commands to be dealt with comprehensively here. This section shall in no way be a complete or an exhaustive command listing, and is in no way a tutorial either. This documentation shall only provide a very basic subset of the available Linux commands, and only those essential to this course.

The Shell and its commands

On logging on, when you observe the following

[username@celeritas ~]$

what you are seeing is the Bash shell, a command line interpreter. The $ is called a 'prompt' and it is waiting on user input. At this point you enter various commands and the shell processes those commands and returns an output depending on the command .

Here are some of the commonly used commands that it would help familiarizing oneself with.

pwd: This command returns the directory that you are currently in. By default, when you login, you are at your home directory, and this commands returns precisely that.
ls: This command lists the contents of the directory. Initially it will be empty, as you have not created any files. However, as you add files and folders as the course progresses, you will find yourself using this command quite often.
cd directory: This command allows you to change the current working director to the specified one. This of course, assumes that the directory has been created in the first place. Executing cd without listing any directory takes you back to your home directory. pwd will let you know where you currently are.
mkdir name: This commands creates an empty directory with the specified name.
less filename: This command lists the content of the file filename to the screen. You can not make edits to the file while it is being listed out. You can use the up and down arrow keys to scroll through the output.

I am going to need some help with this ..

What is the level of familiarity that we can expect from the students?

Chapter 4. Available software

Table of Contents


Myrinet Drivers
The TORQUE Resource Manager
The MAUI Process Scheduler
OpenMP
MPI
MPI-2
Performance API
TAU
Condor
ATLAS Libraries

All available software listed here are either a part of the ROCKS installation, or have additionally been compiled from source. The sources for the packages, where applicable, are available under /home/sources.

Table 4.1. Celeritas Software Listing

Name	Version	Location	Documentation	Source	Description
Linux kernel sources	2.6.9	`/home/packages/actualKernel`		source	Has been patched with perfctr to enable hardware counters/performance monitoring
ATLAS	3.6.0	`/home/sources/ATLAS`	docs	source	Automatically Tuned Linear Algebra Software - provides linear algebra routines, optimized for the hardware
Myrinet drivers	1.1.5	`/home/packages/mx-1.1.5`	docs	Available from Myri on request
Performance API	3.5.0	`/home/packages/papi-3.5.0`	docs	sources	PAPI provides a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors
Tuning and Analysis Utilities	2.16	`/home/packages/tau-2.16`	docs	sources	TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++,Java and Python.
Program Database Toolkit	3.9	`/home/packages/pdtoolkit-3.9`	docs	source	PDT is a framework for analyzing source code written in several programming languages and for making rich program knowledge accessible to developers of static and dynamic analysis tools
Linux Performance-Monitoring Counters Driver	2.6.22	`/home/packages/perfctr-2.6.x`		source	This package adds support to the Linux kernel (2.4.16 or newer) for using the Performance-Monitoring Counters (PMCs) found in many modern processors
Condor	6.8.0	`/opt/condor`	docs	source	Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management
Ganglia	2	`/opt/ganglia`	docs	source	Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters
Maui	3.2.5	`/opt/maui`	docs	source	open source job scheduler
Torque	2.1.5-1	`/opt/torque`	docs	source	TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project
Intel Compilers	9.1	`/usr/local/compilers/intel_{fc,cc}_91/`	docs	Licensed	Intel C and Fortran compileres. Only these compilers support OpenMP. The GNU suite compilers do not.
MPICH1 - Ethernet - GNU	1.2.7p1	`/home/packages/mpich1-eth-ch_p4`	docs	source	MPI implementation MPICH1 compiled with GCC/G77 over ethernet
MPICH1 - Myrinet - GNU	1.2.7p1	`/home/packages/mpich1-mx-ch_p4`	docs	source	MPI implementation MPICH1 compiled with GCC/G77 over myrinet
MPICH1 - Ethernet - Intel	1.2.7p1	`/home/packages/mpich1-eth-ch_p4-intel`	docs	source	MPI implementation MPICH1 compiled with icc/ifort over ethernet. mpif90 (Fortran 90) is supported.
MPICH1 - Myrinet - MX	1.2.7p1	`/home/packages/mpich1-mx-ch_p4-intel`	docs	source	MPI implementation MPICH1 compiled with icc/ifort over myrinet. mpif90 (Fortran 90) is supported.
MPICH2	1.0.4p1	`/home/packages/mpich2-ssm`	docs	source	MPICH implementation that supports the MPI-2 specifications. This package has been compiled with the GNU compiler, for use with the ethernet interconnect. It uses the sockets and shared memory communication method.

Additionally, the software sources are available under /home/source_listing.

Myrinet Drivers

The sources for the Myrinet drivers are available from their website, http://www.myri.com. These sources are not available under /home/source as you will have to request access to the sources from Myrinet.

The driver header files, libraries and related binaries are present in /home/packages/mx-1.1.5. You are welcome to read the README in the bin subdirectory of the above folder, and execute some of the benchmarks within the folder to learn more about the Myrinet interconnect. There are tests within the bin/tests subdirectory that allow users to measure latency and bandwidth performance.

Of particular interest are the mx_pingpong and mx_stream commands within the bin/tests directory. View the README for more details. There are additional interesting tools that output various details regarding the myrinet connection, such as the network bandwidth, latency etc.

The TORQUE Resource Manager

PBS, Portable Batch System, is a computer software job scheduler. The version of PBS installed on Celeritas is TORQUE Resource Manager, (Terascale Open-Source Resource and QUEue Manager), an open source fork of OpenPBS version 2.3.12 maintained by Cluster Resources. Torque is responsible for scheduling jobs to execute on the networked Celeritas environment.

You should not run your executables directly on the head node. Remember, you are not the only user on the cluster. In order to ensure that every student gets his/her fair share of the CPU time, you should always submit your job to the queue and let the resource manager handle the requests.

In order to submit your job to the queue, you need to put all required details in a script file, and submit the script file. A script file is nothing more than a plain text file with certain commands and configuration details in it. Let's jump right in and start writing our PBS script. While doing so, let's add as many possible PBS directives as we can, so that this file can be a quick and handy reference.

		# My program's name
		# PBS -N name_of_submitter
		# Request 0 hours, 5 minutes, 0 seconds. 
		#PBS -l walltime=00:05:00
		# The output of stdout is sent to outputFile
		#PBS -o outputFile
		# The output of stderr is sent to errorFile
		#PBS -e errorFile
		# If the job fails, DO NOT rerun it
		#PBS -r n
		# Request 4 nodes, and 2 processors (out of available 4) in each node.
		#PBS -l nodes=4:ppn=2
		## Each comment starts with 2 of '#' and each directive to PBS starts with '#PBS' 
		
		## Immediately after the lines containing the PBS directives, you will have to enter the commands you want executed. 
		## Type in here the same statements that you would type as if you were executing them at the command line
				
		## Let's have some sample commands
		ls
		hostname
		
		## The output of the above commands would have been redirected to outputFile
		
		## When your programs are executing, they will have access to certain environment variables that you will need
		## to reference to. 
		## The most important of all of them, and the only one we will need is PBS_NODEFILE
		## Let's list the contents of that file
		cat $PBS_NODEFILE
				
		# If we are running an MPI program compiled with mpicc, execute the following
		# mpirun -np Q -machinefile $PBS_NODEFILE name_of_executable
		# The np argument mentions the number of processes to spawn, and this is generally the product of number of
		# nodes you request and the processors per node you requester earlier. 
		# The machinefile argument, $PBS_NODEFILE comes from the PBS environment, and it lists the machines that the PBS scheduler
		# has assigned for your code. 
		
		## Just to give the impression that we are ''computing'' for a while, let's ask the program to sleep for a short while
		sleep 20
		## This will give us time to test MAUI commands while our program is ''executing''

Obtain the above file by executing

$wget http://cct.lsu.edu/~hsunda3/samples/sample.pbs

(copy and paste the above command in your terminal to obtain the file)

Either you can create a new batch file, or you can use the sample.pbs. Let's submit this job to the queue, and let it ''compute''.

$qsub sample.pbs

At this point, you could execute

$qstat

to view a listing of the currently submitted jobs to the queue.

The MAUI Process Scheduler

While TORQUE is the resource manager used in Celeritas, the actual scheduling of jobs on the clusters is the responsibility of MAUI, a Cluster Scheduler also buitl and distributed by Cluster Resources. MAUI integrates with Torque and runs the commands in your PBS submit script on the machines allocated to it by Torque.

While Maui functions transparently to the user, there are a couple of Maui commands that you will find useful

$showq: Similar to qstat but gives a more detailed output
$showstart jobid: Gives you an estimate on when your process will start
$canceljob jobid: Cancels your job from the queue.
$checkjob jobid: Checks the status of your job

Once the file has finished executing, the output and the error file will be available in the same directory where you submitted the job from

OpenMP

The OpenMP API supports multi platform shared memory multiprocessing programming in C/C++ and Fortran. It uses a set of compiler directives and library routines and environment variables.

Here is a sample code that uses OpenMP directives.

			#include <omp.h>
			#include <stdio.h>

			int main (int argc, char *argv[]) {
			  int id, nthreads;
			  #pragma omp parallel private(id)
			  {
			    id = omp_get_thread_num();
			    printf("Hello World from thread %d\n", id);
			    #pragma omp barrier
			    if ( id == 0 ) {
			      nthreads = omp_get_num_threads();
			      printf("There are %d threads\n",nthreads);
			    }
			  }
			  return 0;
			}

Obtain the above file by executing

$wget http://cct.lsu.edu/~hsunda3/samples/openmp.c

(copy and paste the above command in your terminal to obtain the file)

On Celeritas, the Intel Compilers support OpenMP. gcc has OpenMP support from version 4.2.0 onwards, and Celeritas has gcc version 3.4.6.

Compile your program and acquire your executable using

$icc -o executable -openmp openmp.c

Now, we submit our executable to PBS. Let's reuse your script. However, we need to make a few modifications.

Remember, OpenMP is for shared memory programming. Therefore we need to request only one node, but more than one processor on that node.
This is accomplished by the following line:
```
#PBS -l nodes=1:ppn=4
```
This ensures we get only one node, but more than one processor for the shared memory on that node.
Next, we need to tell OpenMP how many threads we intend to use. This is set by using the environment variable OMP_NUM_THREADS.
Include this line in your PBS listing -
```
OMP_NUM_THREADS=4
```
to set 4 threads. This value, of course, is determined by the number of processors you requested in the PBS script.
A neater way to do this is to use the PBS environment variable $PBS_NODEFILE. That file lists the machines where you have been given processor time, one per processor. Therefore the number of lines in that file basically tell you the number of processors you have acquired, and it's a good idea to set the number of threads to that value. That is accomplished by this line
```
OMP_NUMTHREADS=`less $PBS_NODEFILE | wc -l`
```
The final line in your script is of course the name of the executable itself.

Submit this file to PBS and wait for your output.

MPI

MPICH is a freely available, portable implementation of the MPI standard (message passing for distributed memory applications) developed at the Argonne National Laboratory.

On Celeritas, 4 different variations of the MPICH implementation exists. The implementations were compiled with either of the GCC or the Intel compilers, and were compiled either for the ethernet interconnect or the myrinet interconnect.

The choice of the implementation to be used is made by setting a variable in your ~/.bashrc file. Read through the comments in the file, and choose the implementation desired by appropriately setting the mpi_path variable.

Here's a sample program listing that can be used to test the MPI implementation.

			#include <stdio.h>
			#include <mpi.h>

			int main( int argc, char *argv[] )
			{

			  int rank, length;
			  char name[BUFSIZ];

			  MPI_Init(&argc, &argv);

			  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
			  MPI_Get_processor_name (name, &length);
			  printf( "Hello, World! Processor -> %s Rank-> %d!\n", name, rank);
			  MPI_Finalize();

			  return 0;
			}

Obtain the above file by executing

$wget http://cct.lsu.edu/~hsunda3/samples/mpi.c

(copy and paste the above command in your terminal to obtain the file)

Let's compile it and get our executable

				$mpicc -o executable mpi.c

Now, before we run it, we need to get our PBS script ready. Note that you will have to enter the earlier command you executed to get the MPI environment in the PBS script as well.

While most details remain the same for the PBS script, we must note that the processors that we want to run our executable are in the $PBS_NODEFILE. This must therefore be passed as an argument to the mpirun function.

mpirun -np `cat $PBS_NODEFILE | wc -l` -machinefile $PBS_NODEFILE executable

. Again, the `cat $PBS_NODEFILE | wc -l` just counts the number of lines in the $PBS_NODEFILE, which is nothing but the product of number of nodes you requested and the number of processors per each node.

The np switch indicates the number of processes to spawn, which is taken from the number of lines in the PBS_NODEFILE file and the -machinefile switch indicates the hostnames of the machines to run the MPI program.

Submit your job to PBS, and wait for the output.

MPI-2

MPI-2 is an extension to the orginally developed MPI-1 specification. MPICH-2 is, again, a freely developed library by Argonne National Labs. There is no implementation of MPI-2 from myrinet for MPICH yet, and therefore the MPICH2 on Celeritas uses the ethernet interconnect only.

By definition, all MPI-1 programs are valid MPI-2 programs as well. Consequently, we can continue to use our existing mpi.c source code for trying out MPICH-2 as well.

There have been some major changes in the way MPI processes are spawned in the MPI-2 protocol. In order to spawn processes, you need to start mpd, a daemon that runs in the background that your processes connect to.

Here are typical MPI-2 related PBS script commands.

				mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE
				mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out
				mpdallexit

The second line is responsible for starting the daemons. The totalnum flag represents the machines on which you want the mpd started. Since the PBS_NODEFILE lists each machine once per processor, we need to find the unique number of machine in the $PBS_NODEFILE. That's what the long pipe does. The -f flag represents the name of the machines. By default MPD is started only once on each machine, hence this file can be directly passed to -f.

The third line executes your file. Note the usage of mpiexec as against mpirun. While mpirun is provided for legacy purposes, mpiexec is the preferred way to spawn processes, as it correctly ties up with the mpd that you started earlier. The -n flag represents the number of processes that you want started.

The final line then merely kills the mpd daemon that you started earlier.

Performance API

PAPI is a tool that enables programmers to see the relation between software performance and processor events. It is being widely used to collect low level performance metrics.PAPI provides predefined high level hardware events summarized from popular processors and direct access to low level native events of one particular processor. Counter multiplexing and overflow handling are also supported.

Operating system support for accessing hardware counters is needed to use PAPI. The kernel that Celeritas is running has been patched for perfctr support.

You are welcome to read the documentation at (I know, I haven't added it. Will do it after the exams)

Here's a sample code taken from the PAPI documentation and instructions on how to compile using the Performance API.

			#include <papi.h>
			#include <stdio.h>

			main()
			{
			const PAPI_hw_info_t *hwinfo = NULL;
			        
			if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT)
			  exit(1);

			if ((hwinfo = PAPI_get_hardware_info()) == NULL)
			  exit(1);

			printf("%d CPU's at %f Mhz.\n",hwinfo->totalcpus,hwinfo->mhz);
			}

Obtain the above file by executing

$wget http://cct.lsu.edu/~hsunda3/samples/papi.c

(copy and paste the above command in your terminal to obtain the file)

				$gcc -I $PAPI_INC -L $PAPI_LIB -lpapi papi.c

In order to compile this with GCC, the compiler needs to be told where the include files and the necessary libraries are. The necessary variables $PAPI_LIB and $PAPI_INC are predefined.

Run the executable inside a PBS script as usual.

PAPI supports the MPI constructs as well. The following is an example script for using PAPI over the myrinet interconnect. For usage of MPICH and MPICH-2 the PBS script can be modified appropriately.

			#include <papi.h>
			#include <mpi.h>
			#include <math.h>
			#include <stdio.h>

			int main(argc,argv)
			int argc;
			char *argv[];
			{
			  int done = 0, n, myid, numprocs, i, rc, retval, EventSet = PAPI_NULL;
			  double PI25DT = 3.141592653589793238462643;
			  double mypi, pi, h, sum, x, a;
			  long_long values[1] = {(long_long) 0};

			  MPI_Init(&argc,&argv);
			  MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
			  MPI_Comm_rank(MPI_COMM_WORLD,&myid);

			  /*Initialize the PAPI library */
			  retval = PAPI_library_init(PAPI_VER_CURRENT);
			  if (retval != PAPI_VER_CURRENT) {
			    fprintf(stderr, "PAPI library init error!\n");
			    exit(1);
			}

			  /* Create an EventSet */
			  if (PAPI_create_eventset(&EventSet) != PAPI_OK)
			    handle_error(1);

			/* Add Total Instructions Executed to our EventSet */
			  if (PAPI_add_event(EventSet, PAPI_TOT_INS) != PAPI_OK)
			    handle_error(1);

			  /* Start counting */
			  if (PAPI_start(EventSet) != PAPI_OK)
			    handle_error(1);

			  while (!done)
			  {
			    if (myid == 0) {
			        printf("Enter the number of intervals: (0 quits) ");
			        scanf("%d",&n);
			    }
			    MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
			    if (n == 0) break;

			    h   = 1.0 / (double) n;
			    sum = 0.0;
			    for (i = myid + 1; i <= n; i += numprocs) {
			        x = h * ((double)i - 0.5);
			        sum += 4.0 / (1.0 + x*x);
			    }
			    mypi = h * sum;

			    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD);

			    if (myid == 0)
			        printf("pi is approximately %.16f, Error is %.16f\n",
			               pi, fabs(pi - PI25DT));
			    }

			   /* Read the counters */
			   if (PAPI_read(EventSet, values) != PAPI_OK)
			     handle_error(1);

			   printf("After reading counters: %lld\n",values[0]);

			   /* Start the counters */
			   if (PAPI_stop(EventSet, values) != PAPI_OK)
			     handle_error(1);
			   printf("After stopping counters: %lld\n",values[0]);

			   MPI_Finalize();
			}

Obtain the above file by executing

$wget http://cct.lsu.edu/~hsunda3/samples/papi_mpi.c

(copy and paste the above command in your terminal to obtain the file)

As was the case with earlier MPI, you will have to set the mpi_path variable in your ~/.bashrc

In addition, in order to compile the program, you will have to execute mpicc with the exact arguments you have to gcc. mpicc is merely a wrapper to gcc that uses the appropriate MPI libraries.

TAU

Until I learn something about Tau, this section will remain empty.

Condor

Condor is a software framework for coarse-grained distributed parallelization of computationally intensive tasks. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers, so-called cycle scavenging.

Condor is developed by the Condor team at the University of Wisconsin-Madison and is freely available for use.

Condor can run both sequential and parallel jobs. Sequential jobs can be run in several different "universes", including "vanilla" which provides the ability to run most "batch ready" programs, and "standard universe" in which the target application is re-linked with the Condor I/O library which provides for remote job I/O and job checkpointing. Condor also provides a "local universe" which allows jobs to run on the "submit host".

On Celeritas, only the Vanilla universe is installed. Consequently, you can't link your programs with Condor libraries, and Condor on Celeritas doesn't support checkpointing and remote system calls.

To get started with a quick example, create a simple Hello World program and an executable named simple. Now create a submit script for condor,

			Universe = vanilla
			Executable = simple
			Arguments = <if you have any command line arguments to pass to your executable>
			Log = Simple.log
			Output = Simple.out
			Error = Simple.error
			Queue

Obtain the above file by executing

$wget http://cct.lsu.edu/~hsunda3/samples/sample.condor

(copy and paste the above command in your terminal to obtain the file)

Now submit this script by executing

				$condor_submit sample.condor

and watch it enter the enque with

				$condor_q

Additionally, the condor_status command reveals more information as well.

Eventually your job will complete, and all statistics will be logged in Simple.log, and the output will appear in Simple.out. The log file will also let you know where (on what node) your job was executed. Additionally, you will be informed that you have a mail. You can ignore this message.

ATLAS Libraries

ATLAS is a software library for linear algebra. It provides an open source implementation of the BLAS APIs for C and F77.

The required header files and the libraries for compiling programs that use the ATLAS routines are available in /home/sources/ATLAS/include and /home/sources/ATLAS/lib/Linux_HAMMER64SSE2_4 respectively.

One important point to note while compiling programs is the order of linking libraries. The dependance of libraries implies that the order of libraries must be liblapack, libcblas (for C programs) and libatlas

An example will make it clear

			#include <atlas_enum.h>
			#include "clapack.h"

			double m[] = {
			  3, 1, 3,
			  1, 5, 9,
			  2, 6, 5
			};

			double x[] = {
			  -1, 3, -3
			};

			int
			main ()
			{
			  int  ipiv[3];
			  int  i, j;
			  int  info;

			  for (i=0; i<3; ++i) {
				for (j=0; j<3; ++j)  printf ("%5.1f", m[i*3+j]);
				putchar ('\n');
			  }

			  info = clapack_dgesv (CblasRowMajor, 3, 1, m, 3, ipiv, x, 3);
			  if (info != 0)  fprintf (stderr, "failure with error %d\n", info);

			  for (i=0; i<3; ++i)  printf ("%5.1f %3d\n", x[i], ipiv[i]);

			  return 0;
			}

Obtain the above file by executing

$wget http://cct.lsu.edu/~hsunda3/samples/algebra.c

(copy and paste the above command in your terminal to obtain the file)

The above code uses the clapack_dgesv routine from the lapack library and the atlas_enum header file. In order to let gcc know the location of the header files, the libraries and the library locations, use the following syntax.

				gcc -I $ATLAS_INC algebra.c -L $ATLAS_LIB -llapack -lcblas -latlas

The ATLAS_INC and ATLAS_LIB environment variables have been defined for you when you log in to the system. Additionally, you will see this again while running the LINPACK benchmark, which also uses the lapack, cblas and atlas libraries.

Chapter 5. Credits and Acknowledgements

I would like to sincerely thank Dr. Thomas Sterling and Dr. Maciej Brodowicz for giving me this opportunity to play around with such powerful toys. It has truly been a wonderful experience watching Celeritas grow and evolve from a single Sun server that wouldn't allow me to install Red Hat to a fully functional HPC machine with a variety of tools. In Dr. Sterling's own words, I am probably one of the richest undergrads on campus in terms of computing power, and I am indeed very grateful for that.

No amount of gratitude will be sufficient for Ravi Parachuri and Sridhar Karra here at CCT, without whom Celeritas could probably have not been a 'cluster' in any definition of the word. They have been of great help throughout the time I have been associated with the cluster.

A great amount of thanks must also be mentioned to the team that created the ROCKS cluster distribution. The distribution certainly enhanced the ease with which the cluster was setup. Naturally, no project of such magnitute exists without glitches, and the members of the npaci-rocks-discuss mailing list have been most helpful in assisting me troubleshoot any problems that rose along the way.

And of course, my greatest thank you's to Dr. Gabrielle Allen for having provided me this opportunity to be associated with CCT and work on something this exciting.