``Multi-physics Simulations on Distributed Computer Resources''
Dr. Christopher J. Freitas
Principal Engineer - Computational Mechanics
Southwest Research Institute
San Antonio, TX 78238-5166
E-mail: cfreitas@swri.edu
URL: http://www.swri.org

A Perspective

Southwest Research Institute (SwRI) has worked in the area of high performance parallel computing for the past several years, starting in the late 1980's. Our work has encompassed MPP class computing, in conjunction with our relationships with the Department of Energy Laboratories (i.e., Sandia National Laboratories and Lawrence Livermore National Laboratory), and Distributed Parallel computing based on Parallel Virtual Machine (PVM), Message Passing Interface (MPI), and high-speed networks. Although, SwRI has extensive experience with traditional MPP class computing, we have focused, from the very beginning on fully distributed parallel computing.

Early on in our research we had extensive interactions and cooperation with nearly every MPP hardware vendor (e.g., Intel, IBM, Convex, Cray, KSR, etc.). All these vendors were offering collaboration and their latest generation of parallel computer for our free use. But, with all these interactions and collaborations we were left feeling uneasy about the direction and focus of the HPC vendor community with regard to parallel computing. It seemed they were still operating in the vector-computing mode in which "big-iron" computers were the norm, and they were focused on getting the most number of CPUs in a single- or multiple-cabinet computer, rather then addressing the true nature of the paradigm of parallel computing. It was and still is the view of SwRI, that the strength of parallel computing is not necessarily in having the next best and biggest machine, but rather to take advantage of the aggregate computing power of existing facilities, supplementing that hardware where required. America and American industry are becoming ever more cost conscious. What they want is the ability to use existing compute cycles, supplemented by moderate cost improvements to achieve improved computational speed. The issue is no longer the amount of speed-up versus number of compute nodes, but rather the amount of reduced computational time it takes to get the compute job done. If one can achieve even a factor of two reduction in wall-time in a real application, this can translate in to enormous savings in man-hours and provide potentially more accurate solutions at lower cost.

The validity of our belief has been borne out in a large number of externally funded research programs at SwRI in which distributed parallel computing methods are being used in the service of projects. In particular, SwRI's vision of parallel computing is being refined for NASA applications through our current project on Interactive Meta- Computing Technologies for Industrial Deployment, NASA Contract No. 961000.

Research Activities

SwRI has focused on the technologies related to the efficient use of workstation clusters connected by high-speed networks to solve real world engineering and science problems. This research has been based on the message passing paradigm using primarily Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) environments. These tools have been coupled to numerical techniques such as domain decomposition, in which complex geometries are broken into simpler subregions and in which independent grid systems may be created. These independent grid systems or operators are then used to solve, in an ordered fashion, the governing partial differential equations, directly accounting for the internal boundary interfaces that now exist between grid blocks or data structures required by the multiple operators. Coupled to these techniques for the solution of the governing equations has been the development of new methods for scientific visualization in which computational data may be retrieved during the course of the computation for real-time evaluation. The kernel to this visualization system is a data cache for which tools have been developed at SwRI for cache manipulation. In addition to the software methodologies developed for equation solving and scientific visualization, SwRI has also created a second generation ATM Testbed. ATM, Asynchronous Transfer Mode, is a high-speed switch-based network that promises performance levels of more than an order-of-magnitude over standard or switched Ethernet or other competing techniques. The SwRI ATM system connects several clusters of workstations, two MPP computers, one second-generation Distributed Array Processor (DAP) computer, and two multiprocessor graphics systems via multiple FORE ASX-1000 and ASX-200 ATM switches, and dual OC-3c optical links.

A broad spectrum of scientific applications using parallel computing strategies have been developed at SwRI. These application areas include: turbulent, compressible/incompressible, reactive flow in complex geometries, subsurface reactive flow transport, space weather simulation, probabilistic structural dynamics, and large deformation material response modeling. The algorithms that solve these problems use a range of sub-algorithms with different computational characteristics. These sub-algorithms may then be mapped to a distributed virtual computer in which different computer resources are matched to the requirements of the different sub-algorithms. The efficient use of these distributed computer resources requires an understanding of the scalability of the solution process, and tools for dynamic load balancing and fault tolerance and recovery. Tools that perform dynamic resource allocation, in which both the computer and the pipeline connecting it to other resources are monitored for performance in the dynamic balancing of computations, are being developed at SwRI.