Marine Ecosystem Dynamics Modeling Laboratory

Running FVCOM on a Linux cluster

Hardware requirements 

This treatise will help an FVCOM user setup and run models on a Beowulf-like computing cluster. It is by no means comprehensive and therefore should only serve as a guide. Use your judgment when conflicts arise. There is a great deal of useful information available on the web. It is a good idea to Google search error messages and topics that appear unclear.
Begin by purchasing a few good machines from a company that provides high quality tech support. A Linux operating system will perform well and is assumed throughout this document, however, a Unix or Irix system may also be used. Of course, you want the fastest processors, dual processors are nice, and the most memory that you can get into each machine. When deciding how many computers are necessary for a good cluster, keep in mind that more is not always better. You are looking to balance model complexity with number of machines communicating, taking into account that both computation and communication take time.
With that said, you especially do not want to skimp on the cluster communications hardware. Purchase the fastest Ethernet cards that you can afford. The choices available at this time are: Ethernet/fast Ethernet/gigabit Ethernet/Myrinet/Infiniband. A dedicated, multiport switch (HP Procurve) designed to be used with the Ethernet card of choice, is essential. The goal is to create a low latency communication system between the machines (nodes) in your cluster.
One node of the cluster will be designated as the master. This is where the software will be loaded, compilations performed and output dumped and therefore, may need extra memory. The master node will be the only node capable of communicating with the outside world, thus requiring an extra Ethernet card and a fixed IP address. The master must be connected to a monitor and keyboard. The other nodes, referred to as slaves, may only communicate with each other and the master node.

The Cluster Communication framework

Network Setup

Begin by setting up the network interface and editing some basic security files which will allow node to node communication. For a short time you will be required to switch the monitor and keyboard to each node as the setup is performed.
While the keyboard and monitor are connected to the master node, select the hostname master. Enter this variable by typing:
>hostname master

Set the hostname for each node using the same process. Choose a system name to be used in the IP addresses of your cluster (in this case, “my” is the system name). Use the ifconfig utility to configure the network cards and IP addresses of each node. You can check the configuration results in the /etc/hosts file. It should read like this:
127.0.0.1 localhost.localdomain localhost
129.84.74.202 my.msrc.sunysb.edu my
192.168.1.200 master.my.cluster master
192.168.1.101 node1.my.cluster node1
192.168.1.102 node2.my.cluster node2

Once the IP addresses are configured, allowances must be made in the security system of each node.
To do this, edit the etc/hosts.allow files to read:
>>ALL: .my.cluster

For security reasons be sure the hosts.deny file reads:
>> ALL:ALL

This section will only deal with enabling cluster communication, real intruder security issues must be dealt with by a system administrator.

Enabling Password-free SSH

At this point you can access the slave nodes through the master by using either the rsh (remote shell) or the secure shell (ssh) protocols. It is advisable to use the ssh protocol as it is inherently more secure. It is not necessary to be the root user and so this is also a good time to create user accounts. Logged into the master node, a user should type:
>>ssh-keygen

This will generate a private key and a public key. The public key is then copied into the authorized_keys file of the other nodes.

Constructing a Network File Sharing System

Enable the network file system (NFS) by first creating /master/home and /master/usr directories off the root of each slave node. Edit the etc/exports file on the master to export those two partitions to the slave nodes.
/home node1.my.cluster(rw) node2.my.cluster(rw)
/usr node1.my.cluster(rw) node2.my.cluster(rw)

Edit the master etc/fstab file:
node1.my.cluster:/home /node1/home nfs defaults 1 2
node1.my.cluster:/usr /node1/usr nfs defaults 1 2

node2.my.cluster:/home /node2/home nfs defaults 1 2
node2.my.cluster:/usr /node2/usr nfs defaults 1 2

NFS uses portmap, therefore, after these changes are made, it is necessary to recycle both the portmap and NFS daemons.
Repeat this NFS setup on each node, exporting both the /home and /usr directories of each node to the master and to all the other nodes.

Mount all the file systems available using the command:
>>mount –a

Installing Message Passing

Let me begin with a note about compilers. The FVCOM/ MPI system utilizes the F90 fortran compiler. If you have an Intel processor, you can download a free version of the Intel F90 compiler from http://www.intel.com/software/products/compilers/. It is recommended that the C compiler be compatible with the Fortran compiler. You can download the free Intel C compiler from the web site. Be sure to install them in a logical location such as /usr/opt/intel/. Create the corresponding directory on each of the nodes and provide a soft link to the master. Test these by running a simple Hello World program.

Source the compilers before you configure the Message Passing software (MPICH). These two lines can be temporarily placed in the users .bashrc file or entered at startup while configuring MPI.
>>source/usr/opt/intel/intel_cc_80/bin/iccvars.sh
>>source/usr/opt/intel/intel_fc_80/bin/ifortvars.sh

The message passing software is available free of charge from
http://www-unix.mcs.anl.gov/mpi/mpich . Download the latest version and bookmark the users page (you will be back). Unpack and untar the downloaded distribution in a location such as /usr/local/mpich. A few, carefully chosen, environmental variables are required to properly install the message passing software. The syntax used here is for a bash shell; use setenv for a cshell. These are the recommended settings:
>>export CC=icc
>>export FC=ifort
>>export FFLAGS="-O4"
>>export RSHCOMMAND=/usr/bin/ssh
>>export CCFLAGS="-fast -DUSE_U_INT_FOR_XDR -DHAVE_RPC_H=1"
>>export F90FLAGS=" -O4"
>>export F90=ifort

Next, go to the build directory and try to configure:
>>./configure –with –comm=shared — with –device=ch_p4

You may also need to add:
–enable-sharedlib –fc=/usr/opt/intel/intel_fc_80/bin/ifort

If you are not happy with the configuration, it can be run again. This will result in changes in the file /mpich/config.status. Examine it carefully. If you must reconfigure after running make install, rerun make clean, make and make install.

You can check the config.log file if you have problems. If not:
>>make clean
>>make
>>make install

This will install the newly built libraries and scripts in the directory specified during the configuration. This process installs the MPI software on the master node. The corresponding (/usr/local/mpich/) directories should be created on each of the nodes and a soft link made through the NFS system (mpich -> /master/usr/local/mpich) back to the master.

Testing MPI

MPI applies wrappers to the compilers (ie. mpicc and mpif90) which will be used throughout the MPI application. It is easy for the MPI configure setup to apply the wrapper to the wrong compiler. This is particularly a problem with Linux and the default gcc compiler. To check this, run the mpicc cpi.c and mpif90 fpi.f scripts that come with the Mpich distribution.
“Error while trying to run a simple c program”
Meaning: The wrappers are probably placed incorrectly.
Before running an MPI program, you must edit the mpich machine file usr/local/mpich/util/machines/machines.LINUX and make sure that all nodes are listed here. The order listed will be order of processor allocation.
A user should create an mpitest directory in their /home directory on the master and on all other nodes. Create symbolic links from the mpitest directories on the nodes to the corresponding directory on the master.

Be sure that /usr/local/mpich/bin is in the users $PATH before proceeding. In the users mpitest directory run tstmachines –v . You will know if it works.
“P4error child process exited while making connection to master”
Meaning: Be sure that you are using the correct hostnames (/etc/hosts) which correspond to the internal cluster IP addresses.

Once you get this far you are assumed to be running on all machines within the cluster. You can now divide a calculation up amongst the nodes. The previously run script, mpicc cpi.c , generates the default output file a.out. Here, a.out will calculate Pi to 17 figures. To calculate using 4 processors within the cluster, run:
>>mpirun -np 4 ./a.out

If you have hyperthreading activated on the processors, deactivate it. This is also a good time to do some bandwidth and latency tests. When testing for maximum bandwidth, Linux systems leave shared memory segments and semaphores around.
“P4 error semget failed for setnum:0”
Meaning: Leftover semaphores need to be removed.
Run
>>cleanipcs

Installing FVCOM

Get the latest version of FVCOM from http://fvcom.smast.umassd.edu/download/ . Unpack it in a reasonable location such as /usr/local/fvcom.
>>gunzip –c FVCOMxxx | tar xvf –

Create an fvcom directory, with soft links, on all other nodes. Also, remember to put /usr/local/fvcom/FVCOM#.##/FVCOM_source in the users $PATH.

NetCDF may be installed prior to generating the FVCOM make file, however, it is prudent to test FVCOM first, then install NetCDF, make clean and finally generate a new FVCOM make file.

Edit the METIS.source/makefile, so the variable cc points to your C compiler.
>>make

Make produces a libmetis.a file and an FVCOM_source/makefile.

You need to edit the FVCOM_source/makefile for your needs. Detailed directions are given in the README file. Below you will find some additional tips.

Begin with the correct shell that you are using.
#———–BEGIN MAKEFILE————————————————-
SHELL = /bin/bash %if you use a bash shell otherwise use /bin/csh

Next, a series of user defined selections are uncommented or modified as listed below:
#——-BEGIN USER DEFINITION SECTION—————
Under the section ———————– MULTI_PROCESSOR———————– Uncomment the FLAG_3 = -DMULTIPROCESSOR option.

In the next section, netcdf options if you are going to use NetCDF now.
#————NETCDF OUTPUT DUMP OUTPUT INTO NETCDF FILES (yes/no) —–
FLAG_5 = -DNETCDF_IO
IOLIBS = -L/usr/local/lib -lnetcdf
IOINCS = -I/usr/local/include

Select your platform from the list under the select compiler/platform definitions section,
#——————-SELECT COMPILER/PLATFORM SPECIFIC DEFINITIONS———- and edit them also. In this case the processors are Intel.
—————Parallel Intel Compiler Definitions (SMAST) —————
CPP = /usr/bin/cpp
CPPFLAGS = $(DEF_FLAGS) -DINTEL
FC = /usr/local/mpich/mpich-1.2.6/bin/mpif90
DEBFLGS = #-check all
OPT = -fast -DUSE_U_INT_FOR_XDR -DHAVE_RPC_RPC_H=1
CLIB = -static-libcxa

Do not make any changes after the —-END USER DEFINITION SECTION—–.

Installing NetCDF

Network Common Data Form (NetCDF) is an interface that creates portable, self-describing output from array-oriented data. The software is freely available from http://my.unidata.ucar.edu/content/software/netcdf/index.html. Download and unpack the tarfile to the netCDF/src directory.
While installing this software, set the environmental variables:
>>export CPPFLAGS=’DpgiFortran –DNDEBUG’
>>export FC=ifort
>>export FFLAGS="-O –mp –static-libcxa’
>>export F90=’ifort’
>>export FLIBS=-Vaxlib
>>export CXX=c++
>>export CC=gcc

In the /usr/local/netcdf/netcdf-#.#.1/src directory, type:
>>./configure –prefix=/usr/local

The bin, include, lib and man directories will be in installed in the /usr/local directory which is probably in the $PATH already. A config.log file will be generated and should be used for troubleshooting. Now run:
>>make
>>make test
>>make install

It is now possible to go back and reconfigure FVCOM to use the NetCDF option. A good test is to run
>>mpirun –np 4 fvcom chn

FVCOM will dump the model output into the OUTDIR_mymodel/netcdf directory. The files will have a .nc extension. The output may be transferred to the desktop through a secure ftp transfer protocol.
The NetCDF output files are easily manipulated through Matlab which may be installed on the cluster or better yet on a desktop workstation. For Matllab to read the .nc files, the NetCDF toolbox must be in the toolbox path. The appropriate, NetCDF toolbox for Matlab can be downloaded from http://woodshole.er.usgs.gov/staffpages/cdenham/public_html/MexCDF/nc4ml5.html .

If that works, you can begin running your FVCOM models in a multiprocessor (parallel) mode. One cannot expect a program that is executed in t time to now run in exactly t/n time. Where n is the number of processors. This is because the computation may not always be divided into identical pieces and it takes considerable time for the processors to communicate with each other. Very simple models may not execute in less time as n increases because the loss due to an increase in processor communication overwhelms the computational gains.

There are User Guides available for the Intel compilers, Mpich, METIS, FVCOM and NetCDF. They are a valuable resource.

Special thanks to Matthew Jones, Geoffrey Cowles and Heather Crowley.
——————————————————————
By: Maureen Dunn 11/19/2004

Posted on December 17, 2013