Revision History | |
---|---|
Revision 0.01 | 2006-Dec-12 |
First draft posted online on Dec-12 | |
Revision 0.02 | 2007-Jan-14 |
Made some minor modifications, removed the To-Do section, added approrpriate links, modified the MPI subsection | |
Revision 0.03 | 2007-Jan-17 |
Created a table of software listing, added an example code that uses ATLAS | |
Revision 0.04 | 2007-Jan-18 |
All codes listed in here are also available at my home page. The credits section has been added |
Abstract
This documentation is intended to serve as a starter's guide and a quick reference to the Celeritas cluster at the Center for Computation and Technology, Louisiana State University.
The targeted audience are the students of Introduction to High Performance Computing, CSC 7600
Table of Contents
List of Tables
Table of Contents
Celeritas is a traditional Beowulf cluster. The machine celeritas.cct.lsu.edu is the main front end machine, and 8 compute machines are connected to this node through a local area ethernet network. This chapter explores more into the internals of the cluster.
The frontend machine itself has 2 network interfaces, one for the wide area network, used for accessing the cluster from other computers and one for the local area network, which connects the 8 compute nodes.
The IP address of the front end is 130.39.128.68 and it resolves to a DNS name of celeritas.cct.lsu.edu. The machines are all connected through a gigabit ethernet switch. The /etc/hosts lists the IP addresses of the compute nodes.
$cat/etc/hosts # # Do NOT Edit (generated by dbreport) # 127.0.0.1 localhost.localdomain localhost 192.168.1.1 celeritas.local celeritas # originally frontend-0-0 192.168.1.254 compute-0-0.local compute-0-0 c0-0 192.168.1.253 compute-0-1.local compute-0-1 c0-1 192.168.1.252 compute-0-2.local compute-0-2 c0-2 192.168.1.251 compute-0-3.local compute-0-3 c0-3 192.168.1.250 compute-0-4.local compute-0-4 c0-4 192.168.1.249 compute-0-5.local compute-0-5 c0-5 192.168.1.248 compute-0-6.local compute-0-6 c0-6 192.168.1.247 compute-0-7.local compute-0-7 c0-7 130.39.128.68 celeritas.cct.lsu.edu
Students only have access to the front end itself. The compute machines, having hostnames compute-0-0 through compute-0-7, can not be accessed directly, and even though they can be accessed through the front end, there is little incentive to do so. All your work will be done through the front end.
In addition to the gigabit ethernet switch, the machines are also connected using the myrinet interconnect. The command /home/packages/mx-1.1.5/bin/mx_info gives more information regarding the connection.
$/home/packages/mx-1.1.5/bin/mx_info MX Version: 1.1.5 MX Build: root@celeritas.cct.lsu.edu:/home/sources/mx-1.1.5 Thu Dec 7 20:57:47 CST 2006 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. =================================================================== Instance #0: 224.9 MHz LANai, 132.9 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address: 00:60:dd:47:d8:fc Product code: M3F-PCIXD-2 V2.2 Part number: 09-03034 Serial number: 284897 Mapper: 00:60:dd:47:e9:4c, version = 0x55f25eee, configured Mapped hosts: 11 ROUTE COUNT INDEX MAC ADDRESS HOST NAME P0 ----- ----------- --------- --- 0) 00:60:dd:47:d8:fc celeritas.cct.lsu.edu:0 1,1 1) 00:60:dd:47:e9:4c compute-1-0.local:0 1,1 2) 00:60:dd:47:d8:1a compute-1-1.local:0 1,1 3) 00:60:dd:47:d9:05 compute-0-0.local:0 1,1 4) 00:60:dd:47:d8:fa compute-0-1.local:0 1,1 5) 00:60:dd:47:d9:04 compute-0-2.local:0 1,1 6) 00:60:dd:47:d9:01 compute-0-3.local:0 1,1 7) 00:60:dd:47:d9:97 compute-0-4.local:0 1,1 8) 00:60:dd:47:d9:03 compute-0-5.local:0 1,1 9) 00:60:dd:47:d8:fb compute-0-6.local:0 1,1 10) 00:60:dd:47:d8:f6 compute-0-7.local:0 1,1
However, ethernet emulation over myrinet is not enabled, and consequently you can not use the myrinet interface for general TCP/IP related activities. The myrinet interface should solely be used for MPICH purposes only.
Users' home directories are located on a XFS formatted 5 Terabyte storage space on the front end node under /home> and are NFS exported to the compute nodes. Consequently, your binaries need not be propagated to individual child nodes.
Most required softwares are installed under /home/packages and are exported under the same path to the compute nodes.
The Celeritas cluster uses the Linux kernel and has the Rocks Cluster Distribution installed. The Rocks Cluster distribution is a Linux distribution based of CentOS that has custom packages added and is modified for easing the deployment of beowulf clusters.
This kernel has been compiled for 64 bit support and has been patched to support performance monitoring. The kernel also has SMP support builtin. Here are some other important details.
$uname -a Linux celeritas.cct.lsu.edu 2.6.9-prep #1 SMP Thu Dec 7 20:32:47 CST 2006 x86_64 x86_64 x86_64 GNU/Linux $lsb_release -a LSB Version: :core-3.0-amd64:core-3.0-ia32:core-3.0-noarch:graphics-3.0-amd64:graphics-3.0-ia32:graphics-3.0-noarch Distributor ID: CentOS Description: CentOS release 4.4 (Final) Release: 4.4 Codename: Final $/lib64/libc.so.6 GNU C Library stable release version 2.3.4, by Roland McGrath et al. Copyright (C) 2005 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 3.4.5 20051201 (Red Hat 3.4.5-2). Compiled on a Linux 2.4.20 system on 2006-08-13. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others linuxthreads-0.10 by Xavier Leroy The C stubs add-on version 2.1.2. GNU Libidn by Simon Josefsson BIND-8.2.3-T5B libthread_db work sponsored by Alpha Processor Inc NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>
All 9 machines are homogenous SunFire X4200 servers. Each machine has two 70 GB SATA hard disks configured as a RAID mirror, and has a copy of the operating system on it. Additionally, the front end machine has a 5 terabyte Apple XServe Raid Storage attached, which houses the home directories of the users.
Each machine has 2 Dual Core AMD Opteron 64 bit processors in a SMP fashion, for a total of 4 processing cores on each node.
Each machine has 8 GB of shared memory available to the processing cores.
Here are the relevant commands and the corresponding outputs.
$cat/proc/cpuinfo processor : 3 vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 285 stepping : 2 cpu MHz : 2592.664 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni bogomips : 5184.51 TLB size : 1088 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp
The above is repeated 3 times, one for each processor
$cat/proc/meminfo MemTotal: 8046284 kB MemFree: 3652576 kB Buffers: 149376 kB Cached: 3835220 kB SwapCached: 0 kB Active: 3442996 kB Inactive: 661932 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 8046284 kB LowFree: 3652576 kB SwapTotal: 2096472 kB SwapFree: 2096472 kB Dirty: 140 kB Writeback: 0 kB Mapped: 161516 kB Slab: 263932 kB Committed_AS: 738012 kB PageTables: 6672 kB VmallocTotal: 536870911 kB VmallocUsed: 4560 kB VmallocChunk: 536865787 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB $df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 20G 6.1G 13G 33% / none 3.9G 0 3.9G 0% /dev/shm /dev/sda4 42G 11G 30G 27% /export /dev/sda2 3.9G 532M 3.2G 15% /var tmpfs 1.9G 4.2M 1.9G 1% /var/lib/ganglia/rrds /dev/mapper/VolGroup01-RaidLV01 5.4T 5.0G 5.4T 1% /home
Rocks Clusters ships with an excellent tool, Ganglia, that allows users to monitor all nodes in a cluster through a web based interface. You are welcome to access the Ganglia Monitoring page for Celeritas to have a look at the CPUs and other details.
Table of Contents
Celeritas is primarily meant for the CSC 7600 course. As students of the course, you will be assigned an ID of the form cs7600xx and a password. This username and password will be consistent across Celeritas, SuperMike and the online discussion forums. Your password will be a random generated alpha numeric sequence. Do not attempt to change it on any machine, you will receive an error.
Access to Celeritas is restricted through SSH only. SSH is a secure and encrypted protocol. You will need a SSH client on your machine to access Celeritas.
If you run Mac OS X or any Linux distribution on your machine, you already have a built in SSH client.
The first time you login, you will be shown the following.
username@celeritas.cct.lsu.edu's password: Creating directory '/home/username'. Last login: Sun Dec 10 17:21:31 2006 from px08.cct.lsu.edu Rocks 4.2.1 (Cydonia) Profile built 01:17 08-Nov-2006 Kickstarted 20:01 07-Nov-2006 Rocks Frontend Node - Celeritas Cluster It doesn't appear that you have set up your ssh key. This process will make the files: /home/username/.ssh/id_rsa.pub /home/username/.ssh/id_rsa /home/username/.ssh/authorized_keys Generating public/private rsa key pair. Created directory '/home/username/.ssh'. Your identification has been saved in /home/username/.ssh/id_rsa. Your public key has been saved in /home/username/.ssh/id_rsa.pub. The key fingerprint is: 45:da:ee:54:03:d8:2a:75:c9:18:31:09:02:42:02:3e username@celeritas.cct.lsu.edu [username@celeritas ~]$
This process generates the necessary SSH keys that will enable applications like MPICH to run correctly. Do not worry if you are not familiar with SSH keys, they are covered in a later part of this documentation.
At this point, you are logged in to Celeritas. As mentioned earlier, Celeritas runs Linux. If you are familiar with Linux command line tools, you can jump forward to the chapter on the available software. Otherwise, you are encouraged to read the chapter on familiarizing yourself with Linux.
Table of Contents
Linux contains way too many commands to be dealt with comprehensively here. This section shall in no way be a complete or an exhaustive command listing, and is in no way a tutorial either. This documentation shall only provide a very basic subset of the available Linux commands, and only those essential to this course.
On logging on, when you observe the following
[username@celeritas ~]$
what you are seeing is the Bash shell, a command line interpreter. The $ is called a 'prompt' and it is waiting on user input. At this point you enter various commands and the shell processes those commands and returns an output depending on the command .
Here are some of the commonly used commands that it would help familiarizing oneself with.
Table of Contents
Table 4.1. Celeritas Software Listing
Name | Version | Location | Documentation | Source | Description |
---|---|---|---|---|---|
Linux kernel sources | 2.6.9 | /home/packages/actualKernel | source | Has been patched with perfctr to enable hardware counters/performance monitoring | |
ATLAS | 3.6.0 | /home/sources/ATLAS | docs | source | Automatically Tuned Linear Algebra Software - provides linear algebra routines, optimized for the hardware |
Myrinet drivers | 1.1.5 | /home/packages/mx-1.1.5 | docs | Available from Myri on request | |
Performance API | 3.5.0 | /home/packages/papi-3.5.0 | docs | sources | PAPI provides a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors |
Tuning and Analysis Utilities | 2.16 | /home/packages/tau-2.16 | docs | sources | TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++,Java and Python. |
Program Database Toolkit | 3.9 | /home/packages/pdtoolkit-3.9 | docs | source | PDT is a framework for analyzing source code written in several programming languages and for making rich program knowledge accessible to developers of static and dynamic analysis tools |
Linux Performance-Monitoring Counters Driver | 2.6.22 | /home/packages/perfctr-2.6.x | source | This package adds support to the Linux kernel (2.4.16 or newer) for using the Performance-Monitoring Counters (PMCs) found in many modern processors | |
Condor | 6.8.0 | /opt/condor | docs | source | Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management |
Ganglia | 2 | /opt/ganglia | docs | source | Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters |
Maui | 3.2.5 | /opt/maui | docs | source | open source job scheduler |
Torque | 2.1.5-1 | /opt/torque | docs | source | TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project |
Intel Compilers | 9.1 | /usr/local/compilers/intel_{fc,cc}_91/ | docs | Licensed | Intel C and Fortran compileres. Only these compilers support OpenMP. The GNU suite compilers do not. |
MPICH1 - Ethernet - GNU | 1.2.7p1 | /home/packages/mpich1-eth-ch_p4 | docs | source | MPI implementation MPICH1 compiled with GCC/G77 over ethernet |
MPICH1 - Myrinet - GNU | 1.2.7p1 | /home/packages/mpich1-mx-ch_p4 | docs | source | MPI implementation MPICH1 compiled with GCC/G77 over myrinet |
MPICH1 - Ethernet - Intel | 1.2.7p1 | /home/packages/mpich1-eth-ch_p4-intel | docs | source | MPI implementation MPICH1 compiled with icc/ifort over ethernet. mpif90 (Fortran 90) is supported. |
MPICH1 - Myrinet - MX | 1.2.7p1 | /home/packages/mpich1-mx-ch_p4-intel | docs | source | MPI implementation MPICH1 compiled with icc/ifort over myrinet. mpif90 (Fortran 90) is supported. |
MPICH2 | 1.0.4p1 | /home/packages/mpich2-ssm | docs | source | MPICH implementation that supports the MPI-2 specifications. This package has been compiled with the GNU compiler, for use with the ethernet interconnect. It uses the sockets and shared memory communication method. |
Additionally, the software sources are available under /home/source_listing.
The sources for the Myrinet drivers are available from their website, http://www.myri.com. These sources are not available under /home/source as you will have to request access to the sources from Myrinet.
The driver header files, libraries and related binaries are present in /home/packages/mx-1.1.5. You are welcome to read the README in the bin subdirectory of the above folder, and execute some of the benchmarks within the folder to learn more about the Myrinet interconnect. There are tests within the bin/tests subdirectory that allow users to measure latency and bandwidth performance.
Of particular interest are the mx_pingpong and mx_stream commands within the bin/tests directory. View the README for more details. There are additional interesting tools that output various details regarding the myrinet connection, such as the network bandwidth, latency etc.
PBS, Portable Batch System, is a computer software job scheduler. The version of PBS installed on Celeritas is TORQUE Resource Manager, (Terascale Open-Source Resource and QUEue Manager), an open source fork of OpenPBS version 2.3.12 maintained by Cluster Resources. Torque is responsible for scheduling jobs to execute on the networked Celeritas environment.
You should not run your executables directly on the head node. Remember, you are not the only user on the cluster. In order to ensure that every student gets his/her fair share of the CPU time, you should always submit your job to the queue and let the resource manager handle the requests.
In order to submit your job to the queue, you need to put all required details in a script file, and submit the script file. A script file is nothing more than a plain text file with certain commands and configuration details in it. Let's jump right in and start writing our PBS script. While doing so, let's add as many possible PBS directives as we can, so that this file can be a quick and handy reference.
# My program's name # PBS -N name_of_submitter # Request 0 hours, 5 minutes, 0 seconds. #PBS -l walltime=00:05:00 # The output of stdout is sent to outputFile #PBS -o outputFile # The output of stderr is sent to errorFile #PBS -e errorFile # If the job fails, DO NOT rerun it #PBS -r n # Request 4 nodes, and 2 processors (out of available 4) in each node. #PBS -l nodes=4:ppn=2 ## Each comment starts with 2 of '#' and each directive to PBS starts with '#PBS' ## Immediately after the lines containing the PBS directives, you will have to enter the commands you want executed. ## Type in here the same statements that you would type as if you were executing them at the command line ## Let's have some sample commands ls hostname ## The output of the above commands would have been redirected to outputFile ## When your programs are executing, they will have access to certain environment variables that you will need ## to reference to. ## The most important of all of them, and the only one we will need is PBS_NODEFILE ## Let's list the contents of that file cat $PBS_NODEFILE # If we are running an MPI program compiled with mpicc, execute the following # mpirun -np Q -machinefile $PBS_NODEFILE name_of_executable # The np argument mentions the number of processes to spawn, and this is generally the product of number of # nodes you request and the processors per node you requester earlier. # The machinefile argument, $PBS_NODEFILE comes from the PBS environment, and it lists the machines that the PBS scheduler # has assigned for your code. ## Just to give the impression that we are ''computing'' for a while, let's ask the program to sleep for a short while sleep 20 ## This will give us time to test MAUI commands while our program is ''executing''
Obtain the above file by executing
$wget http://cct.lsu.edu/~hsunda3/samples/sample.pbs
(copy and paste the above command in your terminal to obtain the file)
Either you can create a new batch file, or you can use the sample.pbs. Let's submit this job to the queue, and let it ''compute''.
$qsub sample.pbs
At this point, you could execute
$qstat
to view a listing of the currently submitted jobs to the queue.
While TORQUE is the resource manager used in Celeritas, the actual scheduling of jobs on the clusters is the responsibility of MAUI, a Cluster Scheduler also buitl and distributed by Cluster Resources. MAUI integrates with Torque and runs the commands in your PBS submit script on the machines allocated to it by Torque.
While Maui functions transparently to the user, there are a couple of Maui commands that you will find useful
$showq
$showstart jobid
$canceljob jobid
$checkjob jobid
Once the file has finished executing, the output and the error file will be available in the same directory where you submitted the job from
The OpenMP API supports multi platform shared memory multiprocessing programming in C/C++ and Fortran. It uses a set of compiler directives and library routines and environment variables.
Here is a sample code that uses OpenMP directives.
#include <omp.h> #include <stdio.h> int main (int argc, char *argv[]) { int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); printf("Hello World from thread %d\n", id); #pragma omp barrier if ( id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } return 0; }
Obtain the above file by executing
$wget http://cct.lsu.edu/~hsunda3/samples/openmp.c
(copy and paste the above command in your terminal to obtain the file)
On Celeritas, the Intel Compilers support OpenMP. gcc has OpenMP support from version 4.2.0 onwards, and Celeritas has gcc version 3.4.6.
Compile your program and acquire your executable using
$icc -o executable -openmp openmp.c
Now, we submit our executable to PBS. Let's reuse your script. However, we need to make a few modifications.
Remember, OpenMP is for shared memory programming. Therefore we need to request only one node, but more than one processor on that node.
This is accomplished by the following line:
#PBS -l nodes=1:ppn=4
This ensures we get only one node, but more than one processor for the shared memory on that node.
Next, we need to tell OpenMP how many threads we intend to use. This is set by using the environment variable OMP_NUM_THREADS.
Include this line in your PBS listing -
OMP_NUM_THREADS=4
to set 4 threads. This value, of course, is determined by the number of processors you requested in the PBS script.
A neater way to do this is to use the PBS environment variable $PBS_NODEFILE. That file lists the machines where you have been given processor time, one per processor. Therefore the number of lines in that file basically tell you the number of processors you have acquired, and it's a good idea to set the number of threads to that value. That is accomplished by this line
OMP_NUMTHREADS=`less $PBS_NODEFILE | wc -l`
The final line in your script is of course the name of the executable itself.
Submit this file to PBS and wait for your output.
MPICH is a freely available, portable implementation of the MPI standard (message passing for distributed memory applications) developed at the Argonne National Laboratory.
On Celeritas, 4 different variations of the MPICH implementation exists. The implementations were compiled with either of the GCC or the Intel compilers, and were compiled either for the ethernet interconnect or the myrinet interconnect.
The choice of the implementation to be used is made by setting a variable in your ~/.bashrc file. Read through the comments in the file, and choose the implementation desired by appropriately setting the mpi_path variable.
Here's a sample program listing that can be used to test the MPI implementation.
#include <stdio.h> #include <mpi.h> int main( int argc, char *argv[] ) { int rank, length; char name[BUFSIZ]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name (name, &length); printf( "Hello, World! Processor -> %s Rank-> %d!\n", name, rank); MPI_Finalize(); return 0; }
Obtain the above file by executing
$wget http://cct.lsu.edu/~hsunda3/samples/mpi.c
(copy and paste the above command in your terminal to obtain the file)
Let's compile it and get our executable
$mpicc -o executable mpi.c
Now, before we run it, we need to get our PBS script ready. Note that you will have to enter the earlier command you executed to get the MPI environment in the PBS script as well.
While most details remain the same for the PBS script, we must note that the processors that we want to run our executable are in the $PBS_NODEFILE. This must therefore be passed as an argument to the mpirun function.
mpirun -np `cat $PBS_NODEFILE | wc -l` -machinefile $PBS_NODEFILE executable
. Again, the `cat $PBS_NODEFILE | wc -l` just counts the number of lines in the $PBS_NODEFILE, which is nothing but the product of number of nodes you requested and the number of processors per each node.
The np switch indicates the number of processes to spawn, which is taken from the number of lines in the PBS_NODEFILE file and the -machinefile switch indicates the hostnames of the machines to run the MPI program.
Submit your job to PBS, and wait for the output.
MPI-2 is an extension to the orginally developed MPI-1 specification. MPICH-2 is, again, a freely developed library by Argonne National Labs. There is no implementation of MPI-2 from myrinet for MPICH yet, and therefore the MPICH2 on Celeritas uses the ethernet interconnect only.
By definition, all MPI-1 programs are valid MPI-2 programs as well. Consequently, we can continue to use our existing mpi.c source code for trying out MPICH-2 as well.
There have been some major changes in the way MPI processes are spawned in the MPI-2 protocol. In order to spawn processes, you need to start mpd, a daemon that runs in the background that your processes connect to.
Here are typical MPI-2 related PBS script commands.
mpdboot --totalnum=`cat $PBS_NODEFILE | uniq | wc -l` -f $PBS_NODEFILE mpiexec -n `cat $PBS_NODEFILE | wc -l` a.out mpdallexit
The second line is responsible for starting the daemons. The totalnum flag represents the machines on which you want the mpd started. Since the PBS_NODEFILE lists each machine once per processor, we need to find the unique number of machine in the $PBS_NODEFILE. That's what the long pipe does. The -f flag represents the name of the machines. By default MPD is started only once on each machine, hence this file can be directly passed to -f.
The third line executes your file. Note the usage of mpiexec as against mpirun. While mpirun is provided for legacy purposes, mpiexec is the preferred way to spawn processes, as it correctly ties up with the mpd that you started earlier. The -n flag represents the number of processes that you want started.
The final line then merely kills the mpd daemon that you started earlier.
PAPI is a tool that enables programmers to see the relation between software performance and processor events. It is being widely used to collect low level performance metrics.PAPI provides predefined high level hardware events summarized from popular processors and direct access to low level native events of one particular processor. Counter multiplexing and overflow handling are also supported.
Operating system support for accessing hardware counters is needed to use PAPI. The kernel that Celeritas is running has been patched for perfctr support.
You are welcome to read the documentation at (I know, I haven't added it. Will do it after the exams)
Here's a sample code taken from the PAPI documentation and instructions on how to compile using the Performance API.
#include <papi.h> #include <stdio.h> main() { const PAPI_hw_info_t *hwinfo = NULL; if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) exit(1); if ((hwinfo = PAPI_get_hardware_info()) == NULL) exit(1); printf("%d CPU's at %f Mhz.\n",hwinfo->totalcpus,hwinfo->mhz); }
Obtain the above file by executing
$wget http://cct.lsu.edu/~hsunda3/samples/papi.c
(copy and paste the above command in your terminal to obtain the file)
$gcc -I $PAPI_INC -L $PAPI_LIB -lpapi papi.c
In order to compile this with GCC, the compiler needs to be told where the include files and the necessary libraries are. The necessary variables $PAPI_LIB and $PAPI_INC are predefined.
Run the executable inside a PBS script as usual.
PAPI supports the MPI constructs as well. The following is an example script for using PAPI over the myrinet interconnect. For usage of MPICH and MPICH-2 the PBS script can be modified appropriately.
#include <papi.h> #include <mpi.h> #include <math.h> #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int done = 0, n, myid, numprocs, i, rc, retval, EventSet = PAPI_NULL; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x, a; long_long values[1] = {(long_long) 0}; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); /*Initialize the PAPI library */ retval = PAPI_library_init(PAPI_VER_CURRENT); if (retval != PAPI_VER_CURRENT) { fprintf(stderr, "PAPI library init error!\n"); exit(1); } /* Create an EventSet */ if (PAPI_create_eventset(&EventSet) != PAPI_OK) handle_error(1); /* Add Total Instructions Executed to our EventSet */ if (PAPI_add_event(EventSet, PAPI_TOT_INS) != PAPI_OK) handle_error(1); /* Start counting */ if (PAPI_start(EventSet) != PAPI_OK) handle_error(1); while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); } /* Read the counters */ if (PAPI_read(EventSet, values) != PAPI_OK) handle_error(1); printf("After reading counters: %lld\n",values[0]); /* Start the counters */ if (PAPI_stop(EventSet, values) != PAPI_OK) handle_error(1); printf("After stopping counters: %lld\n",values[0]); MPI_Finalize(); }
Obtain the above file by executing
$wget http://cct.lsu.edu/~hsunda3/samples/papi_mpi.c
(copy and paste the above command in your terminal to obtain the file)
As was the case with earlier MPI, you will have to set the mpi_path variable in your ~/.bashrc
In addition, in order to compile the program, you will have to execute mpicc with the exact arguments you have to gcc. mpicc is merely a wrapper to gcc that uses the appropriate MPI libraries.
Condor is a software framework for coarse-grained distributed parallelization of computationally intensive tasks. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers, so-called cycle scavenging.
Condor is developed by the Condor team at the University of Wisconsin-Madison and is freely available for use.
Condor can run both sequential and parallel jobs. Sequential jobs can be run in several different "universes", including "vanilla" which provides the ability to run most "batch ready" programs, and "standard universe" in which the target application is re-linked with the Condor I/O library which provides for remote job I/O and job checkpointing. Condor also provides a "local universe" which allows jobs to run on the "submit host".
On Celeritas, only the Vanilla universe is installed. Consequently, you can't link your programs with Condor libraries, and Condor on Celeritas doesn't support checkpointing and remote system calls.
To get started with a quick example, create a simple Hello World program and an executable named simple. Now create a submit script for condor,
Universe = vanilla Executable = simple Arguments = <if you have any command line arguments to pass to your executable> Log = Simple.log Output = Simple.out Error = Simple.error Queue
Obtain the above file by executing
$wget http://cct.lsu.edu/~hsunda3/samples/sample.condor
(copy and paste the above command in your terminal to obtain the file)
Now submit this script by executing
$condor_submit sample.condor
and watch it enter the enque with
$condor_q
Additionally, the condor_status command reveals more information as well.
Eventually your job will complete, and all statistics will be logged in Simple.log, and the output will appear in Simple.out. The log file will also let you know where (on what node) your job was executed. Additionally, you will be informed that you have a mail. You can ignore this message.
ATLAS is a software library for linear algebra. It provides an open source implementation of the BLAS APIs for C and F77.
The required header files and the libraries for compiling programs that use the ATLAS routines are available in /home/sources/ATLAS/include and /home/sources/ATLAS/lib/Linux_HAMMER64SSE2_4 respectively.
One important point to note while compiling programs is the order of linking libraries. The dependance of libraries implies that the order of libraries must be liblapack, libcblas (for C programs) and libatlas
An example will make it clear
#include <atlas_enum.h> #include "clapack.h" double m[] = { 3, 1, 3, 1, 5, 9, 2, 6, 5 }; double x[] = { -1, 3, -3 }; int main () { int ipiv[3]; int i, j; int info; for (i=0; i<3; ++i) { for (j=0; j<3; ++j) printf ("%5.1f", m[i*3+j]); putchar ('\n'); } info = clapack_dgesv (CblasRowMajor, 3, 1, m, 3, ipiv, x, 3); if (info != 0) fprintf (stderr, "failure with error %d\n", info); for (i=0; i<3; ++i) printf ("%5.1f %3d\n", x[i], ipiv[i]); return 0; }
Obtain the above file by executing
$wget http://cct.lsu.edu/~hsunda3/samples/algebra.c
(copy and paste the above command in your terminal to obtain the file)
The above code uses the clapack_dgesv routine from the lapack library and the atlas_enum header file. In order to let gcc know the location of the header files, the libraries and the library locations, use the following syntax.
gcc -I $ATLAS_INC algebra.c -L $ATLAS_LIB -llapack -lcblas -latlas
The ATLAS_INC and ATLAS_LIB environment variables have been defined for you when you log in to the system. Additionally, you will see this again while running the LINPACK benchmark, which also uses the lapack, cblas and atlas libraries.
I would like to sincerely thank Dr. Thomas Sterling and Dr. Maciej Brodowicz for giving me this opportunity to play around with such powerful toys. It has truly been a wonderful experience watching Celeritas grow and evolve from a single Sun server that wouldn't allow me to install Red Hat to a fully functional HPC machine with a variety of tools. In Dr. Sterling's own words, I am probably one of the richest undergrads on campus in terms of computing power, and I am indeed very grateful for that.
No amount of gratitude will be sufficient for Ravi Parachuri and Sridhar Karra here at CCT, without whom Celeritas could probably have not been a 'cluster' in any definition of the word. They have been of great help throughout the time I have been associated with the cluster.
A great amount of thanks must also be mentioned to the team that created the ROCKS cluster distribution. The distribution certainly enhanced the ease with which the cluster was setup. Naturally, no project of such magnitute exists without glitches, and the members of the npaci-rocks-discuss mailing list have been most helpful in assisting me troubleshoot any problems that rose along the way.
And of course, my greatest thank you's to Dr. Gabrielle Allen for having provided me this opportunity to be associated with CCT and work on something this exciting.