• Architectures (Tuesday 10:30AM-Noon)
    Room A201/205
    Access Grid Enabled
    Chair: Burton Smith, Cray Inc.

    • Title: Scientific Computing on the Itanium (TM) processor
    • Authors:
      Bruce Greer (Intel Corporation)
      John Harrison (Intel Corporation)
      Greg Henry (Intel Corporation)
      Wei Li (Intel Corporation)
      Peter Tang (Intel Corporation)
    • Abstract:
      The 64-bit Intel{R} Itanium{TM} architecture is designed for high-performance scientific and enterprise computing, and the Itanium processor is its first silicon implementation. Features such as extensive arithmetic support, predication, speculation, and explicit parallelism can be used to provide a sound infrastructure for supercomputing. A large number of high-performance computer companies are offering Itanium{TM}-based systems, some capable of peak performance exceeding 50 GFLOPS. In this paper we give an overview of the most relevant architectural features and provide illustrations of how these features are used in both low-level and high-level support for scientific and engineering computing, including transcendental functions and linear algebra kernels.

    • Title: The Sun Fireplane System Interconnect
    • Authors:
      Alan E Charlesworth (Sun Microsystems, Inc.)
    • Abstract:
      System interconnect is a key determiner of the cost, performance, and reliability of large cache-coherent, shared-memory multiprocessors. Interconnect implementations have to accommodate ever greater numbers of ever faster processors. This paper describes the Sun Fireplane two-level cache-coherency protocol, and its use in the medium and large-sized UltraSPARC-III-based Sun Fire servers.

    • Title: Parallel Graphics and Interactivity with the Scaleable Graphics Engine
    • Authors:
      Kenneth A. Perrine (Pacific Northwest National Laboratory)
      Donald R. Jones (Pacific Northwest National Laboratory)
    • Abstract:
      A parallel rendering environment is being developed to utilize the IBM Scaleable Graphics Engine (SGE), a hardware frame buffer for parallel computers. Goals of this software development effort include finding efficient ways of producing and displaying graphics generated on IBM SP nodes and of assisting programmers in adapting or creating scientific simulation applications to use the SGE. Four software development phases discussed utilize the SGE: tunneling, SMP rendering, development of an OpenGL API implementation which utilizes the SGE in parallel environments, and additions to the SGE-enabled OpenGL implementation that uses threads. The performance observed in software tests show that programmers would be able to utilize the SGE to output interactive graphics in a parallel environment.
  • Software Scalability (Tuesday 10:30AM-Noon)
    Room A102/104/106
    Chair: Klaus Schauser, University of California, Santa Barbara

    • Title: Scalable Parallel Application Launch on Cplant
    • Authors:
      Ron Brightwell (Sandia National Laboratories)
      Lee Ann Fisk (Sandia National Laboratories)
    • Abstract:
      This paper describes the components of a runtime system for launching parallel applications and presents performance results for starting a job on more than a thousand nodes of a workstation cluster. This runtime system was developed at Sandia National Laboratories as part of the Computational Plant (Cplant) project, which is deploying large-scale parallel computing clusters using commodity hardware and the Linux operating system. We have designed and implemented a flexible runtime system that allows for launching parallel jobs on thousands of nodes in a matter of seconds. The interactions of the components are described, and the key issues that address the scalability and performance of the runtime system are discussed. We also present performance results of launching executables of varying sizes on more than a thousand nodes.

    • Title: Scaling Irregular Parallel Codes with Minimal Programming Effort
    • Authors:
      Dimitrios S. Nikolopoulos (University of Illinois at Urbana-Champaign)
      Constantine D. Polychronopoulos (University of Illinois at Urbana-Champaign)
      Eduard Ayguade (Universitat Politecnica de Catalunya)
    • Abstract:
      The long foreseen goal of parallel programming models is to scale parallel code without significant programming effort. Irregular parallel applications are a particularly challenging application domain for parallel programming models, since they require domain specific data distribution and load balancing algorithms. From a performance perspective, shared-memory models still fall short of scaling as well as message-passing models in irregular applications, although they require less coding effort. We present a simple runtime methodology for scaling irregular applications parallelized with the standard OpenMP interface. We claim that our parallelization methodology requires the minimum amount of effort from the programmer and prove experimentally that it is able to scale two highly irregular codes as well as MPI, with an order of magnitude less programming effort. This is probably the first time such a result is obtained from OpenMP, more so, by keeping the OpenMP API intact.

    • Title: A Parallel Java Grande Benchmark Suite
    • Authors:
      L. A. Smith (EPCC, University of Edinburgh)
      J. M. Bull (EPCC, University of Edinburgh)
      J. Obdrzalek (Faculty of Informatics, Masaryk University)
    • Abstract:
      Increasing interest is being shown in the use of Java for large scale or Grande applications. This new use of Java places specific demands on the Java execution environments that can be tested using the Java Grande benchmark suite. The large processing requirements of Grande applications makes parallelisation of interest. A suite of parallel benchmarks has been developed from the serial Java Grande benchmark suite, using three parallel programming models: Java native threads, MPJ (a message passing interface) and JOMP (a set of OpenMP-like directives). The contents of the suite are described, and results presented for a number of platforms.
  • Communication Structures (Tuesday 10:30AM-Noon)
    Room A108/110/112
    Chair: Phil Papadopoulos, NPACI

    • Title: ORT - A Communication Library for Orthogonal Processor Groups
    • Authors:
      Thomas Rauber (Institut für Informatik, Universität Halle-Wittenberg)
      Robert Reilein (Fakultät für Informatik, Technische Universität Chemnitz)
      Gudula Rünger (Fakultät f¸r Informatik, Technische Universität Chemnitz)
    • Abstract:
      Many implementations on message-passing machines can benefit from an exploitation of mixed task and data parallelism. A suitable parallel programming model is a group-SPMD model, which requires a structuring of the processors into subsets and a partition of the program into multi-processor tasks. In this paper, we introduce a library support for the specification of message-passing programs in a group-SPMD style allowing different partitions in a single program. We describe the functionality and the implementation of the library functions and illustrate the library programming style with example programs. The examples show that the runtime on distributed memory machines can be considerably reduced by using the library.

    • Title: On-the-fly Calculation and Verification of Consistent Steering Transactions
    • Authors:
      David W. Miller (University of Georgia)
      Jinhua Guo (University of Georgia)
      Eileen Kraemer (University of Georgia)
      Yin Xiong (University of Georgia)
    • Abstract:
      Interactive Steering can be a valuable tool for understanding and controlling a distributed computation in real-time. With Interactive Steering, the user may change the state of a computation by adjusting application parameters on-the-fly. In our system, we model both the programís execution and steering actions in terms of transactions. We define a steering transaction as consistent if its vector time is not concurrent with the vector time of any program transaction. That is, consistent steering transactions occur ìbetweenî program transactions, at a point that represents a consistent cut. In this paper, we present an algorithm for verifying the consistency of steering transactions. The algorithm analyzes a record of the program transactions and compares it against the steering transaction; if the time at which the steering transaction was applied is inconsistent, the algorithm generates a vector representing the earliest consistent time at which the steering transaction could have been applied.

    • Title: Removing the Overhead from Software-Based Shared Memory
    • Authors:
      Zoran Radovic (Uppsala University)
      Erik Hagersten (Uppsala University)
    • Abstract:
      The implementation presented in this paper---DSZOOM-WF---is a sequentially consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation to date of around ten microseconds.

      The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging InfiniBand standard. All interrupt-and/or poll-based asynchronous protocol processing is completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall. The technique is applicable to both page-based and fine-grain software-based shared memory.

      DSZOOM-WF consistently demonstrates performance comparable to hardware-based distributed shared memory implementations.

  • Material Science Applications (Tuesday 1:30-3:00PM)
    Room A201/205
    Access Grid Enabled
    Chair: Robert Eades, Pacific Northwest Laboratory

    • Title: Scalable Atomistic Simulation Algorithms for Materials Research
    • Authors:
      Aiichiro Nakano (Louisiana State University)
      Rajiv K. Kalia (Louisiana State University)
      Priya Vashishta (Louisiana State University)
      Timothy J. Campbell (Logicon Inc. and Naval Oceanographic Office Major Shared Resource Center)
      Shuji Ogata (Yamaguchi University, Japan)
      Fuyuki Shimojo (Hiroshima University, Japan)
      Subhash Saini (NASA Ames Research Center)
    • Abstract:
      A suite of scalable atomistic simulation programs has been developed for materials research based on space-time multiresolution algorithms. Design and analysis of parallel algorithms are presented for molecular dynamics (MD) simulations and quantum-mechanical (QM) calculations based on the density functional theory. Performance tests have been carried out on 1,088-processor Cray T3E and 1,280-processor IBM SP3 computers. The linear-scaling algorithms have enabled 6.44-billion-atom MD and 111,000-atom QM calculations on 1,024 SP3 processors with parallel efficiency well over 90%. The production-quality programs also feature wavelet-based computational-space decomposition for adaptive load balancing, spacefilling-curve-based adaptive data compression with user-defined error bound for scalable I/O, and octree-based fast visibility culling for immersive and interactive visualization of massive simulation data.

    • Title: An 8.61 Tflop/s Molecular Dynamics Simulation for NaCl with a Special-Purpose Computer: MDM
    • Authors:
      Tetsu Narumi (RIKEN)
      Atsushi Kawai (RIKEN)
      Takahiro Koishi (RIKEN)
      Gordon Bell Prize Finalist
    • Abstract:
      We performed molecular dynamics (MD) simulation of 33 million pairs of NaCl ions with the Ewald summation and obtained a calculation speed of 8.61 Tflop/s. In this calculation we used a special-purpose computer, MDM, which we have developed for the calculations of the Coulomb and van der Waals forces. The MDM enabled us to perform large scale MD simulations without truncating the Coulomb force. It is composed of MDGRAPE-2, WINE-2 and a host computer. MDGRAPE-2 accelerates the calculation for real-space part of the Coulomb and van der Waals forces. WINE-2 accelerates the calculation for wavenumber-space part of the Coulomb force. The host computer performs other calculations. With the completed MDM system we performed an MD simulation similar to what was the basis of our SC2000 submission for a Gordon Bell prize. With this large scale MD simulation, we can dramatically decrease the fluctuation of the temperature less than 0.1 Kelvin.

    • Title: Multi-teraflops Spin Dynamics Studies of the Magnetic Structure of FeMn/Co Interfaces
    • Authors:
      A. Canning (Lawrence Berkeley National Laboratory)
      B. Ujfalussy (University of Tennessee)
      T. C. Shulthess (Oak Ridge National Laboratory)
      X. G. Zhang (Oak Ridge National Laboratory)
      W. A. Shelton (Oak Ridge National Laboratory)
      D. M. C. Nicholson (Oak Ridge National Laboratory)
      G. M. Stocks (Oak Ridge National Laboratory)
      Yang Wang (Pittsburgh Supercomputer Center)
      T. Dirks (IBM)
      Gordon Bell Prize Finalist
    • Abstract:
      We have used the power of massively parallel computers to perform first principles spin dynamics (SD) simulations of the magnetic structure of Iron-Manganese/Cobalt (FeMn/Co) interfaces. These large scale quantum mechanical simulations, involving 2016-atom super-cell models, reveal details of the orientational configuration of the magnetic moments at the interface that are unobtainable by any other means. Exchange bias, which involves the use of an antiferromagnetic (AFM) layer such as FeMn to pin the orientation of the magnetic moment of a proximate ferromagnetic (FM) layer such as Co, is of fundamental importance in magnetic multilayer storage and read head devices. Here the equation of motion of first principles SD is used to perform relaxations of model magnetic structures to the true ground (equilibrium) state. Our code is intrinsically parallel and has achieved a maximum execution rate of 2.46 Teraflops on the IBM SP at the National Energy Research Scientific Computing Center (NERSC).
  • Mesh Methods (Tuesday 1:30-3:00PM)
    Room A102/104/106
    Chair: Steve Ashby, Lawrence Livermore National Laboratory

    • Title: Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formation
    • Authors:
      Greg L. Bryan (MIT)
      Tom Abel (Harvard)
      Michael L. Norman (UCSD)
      Gordon Bell Prize Finalist
    • Abstract:
      As an entry for the 2001 Gordon Bell Award in the "special" category, we describe our 3-d, hybrid, adaptive mesh refinement (AMR) code Enzo designed for high-resolution, multiphysics, cosmological structure formation simulations. Our parallel implementation places no limit on the depth or complexity of the adaptive grid hierarchy, allowing us to achieve unprecedented spatial and temporal dynamic range. We report on a simulation of primordial star formation which develops over 8000 subgrids at 34 levels of refinement to achieve a local refinement of a factor of 10**12 in space and time. This allows us to resolve the properties of the first stars which form in the universe assuming standard physics and a standard cosmological model. Achieving extreme resolution requires the use of 128-bit extended precision arithmetic (EPA) to accurately specify the subgrid positions. We describe our EPA AMR implementation on the IBM SP2 Blue Horizon system at the San Diego Supercomputer Center.

    • Title: A Distributed Memory Unstructured Gauss-Seidel Algorithm for Multigrid Smoothers
    • Authors:
      Mark Adams (Sandia National Laboratories)
    • Abstract:
      Gauss-Seidel is a popular multigrid smoother as it is provably optimal on structured grids and exhibits superior performance on unstructured grids. Gauss-Seidel is not used to our knowledge on distributed memory machines as it is not obvious how to parallelize it effectively. We, among others, have found that Krylov solvers preconditioned with Jacobi, block Jacobi or overlapped Schwarz are effective on unstructured problems. Gauss-Seidel does however have some attractive properties, namely: fast convergence, no global communication (ie, no dot products) and fewer flops per iteration as one can incorporate an initial guess naturally. This paper discusses an algorithm for parallelizing Gauss-Seidel for distributed memory computers for use as a multigrid smoother and compares its performance with preconditioned conjugate gradients on unstructured linear elasticity problems with up to 76 million degrees of freedom.

    • Title: Multilevel Algorithms for Generating Coarse Grids for Multigrid Methods
    • Authors:
      Irene Moulitsas (University of Minnesota)
      George Karypis (University of Minnesota)
    • Abstract:
      Geometric Multigrid methods have gained widespread acceptance for solving large systems of linear equations, especially for structured grids. One of the challenges in successfully extending these methods to unstructured grids is the problem of generating an appropriate set of coarse grids. The focus of this paper is the development of robust algorithms, both serial and parallel, for generating a sequence of coarse grids from the original unstructured grid. Our algorithms treat the problem of coarse grid construction as an optimization problem that tries to optimize the overall quality of the resulting fused elements. We solve this problem using the multilevel paradigm that has been very successful in solving the related grid/graph partitioning problem. The parallel formulation of our algorithm incurs a very small communication overhead, achieves high degree of concurrency, and maintains the high quality of the coarse grids obtained by the serial algorithm.
  • Computational Grid Portals & Networks (Tuesday 1:30-3:00PM)
    Room A108/110/112
    Chair: David Abramson, Monash University, AU

    • Title: The XCAT Science Portal
    • Authors:
      Sriram Krishnan (Department of Computer Science, Indiana University, Bloomington, IN)
      Randall Bramley (Department of Computer Science, Indiana University, Bloomington, IN)
      Dennis Gannon (Department of Computer Science, Indiana University, Bloomington, IN)
      Madhusudhan Govindaraju (Department of Computer Science, Indiana University, Bloomington, IN)
      Rahul Indurkar (Department of Computer Science, Indiana University, Bloomington, IN)
      Aleksander Slominski (Department of Computer Science, Indiana University, Bloomington, IN)
      Benjamin Temko (Department of Computer Science, Indiana University, Bloomington, IN)
      Jay Alameda (National Computational Science Alliance)
      Richard Alkire (Department of Chemical Engineering, University of Illinois, Urbana-Champaign, IL)
      Timothy Drews (Department of Chemical Engineering, University of Illinois, Urbana-Champaign, IL)
      Eric Webb (Department of Chemical Engineering, University of Illinois, Urbana-Champaign, IL)
      Best Student Paper Finalist
    • Abstract:
      The design and prototype implementation of the XCAT Grid Science Portal is described in this paper. The portal lets grid application programmers easily script complex distributed computations and package these applications with simple interfaces for others to use. Each application is packaged as a "notebook" which consists of web pages and editable parameterized scripts. The portal is a workstation-based specialized "personal" web server, capable of executing the application scripts and launching remote grid applications for the user. The portal server can receive event streams published by the application and grid resource information published by Network Weather Service (NWS) or Autopilot sensors. Notebooks can be "published" and stored in web based archives for others to retrieve and modify. The XCAT Grid Science Portal has been tested with various applications, including the distributed simulation of chemical processes in semiconductor manufacturing and collaboratory support for X-ray crystallographers.

    • Title: A Jini-based Computing Portal System
    • Authors:
      Toyotaro Suzumura (Tokyo Institute of Technology)
      Satoshi Matsuoka (Tokyo Institute of Technology)
      Hidemoto Nakada (National Institute of Advanced Industrial Science and Technology/Tokyo Institute of Technology)
    • Abstract:
      JiPANG(A Jini-based Portal Augmenting Grids) is a portal system and a toolkit which provides uniform access interface layer to a variety of Grid systems, and is built on top of Jini distributed object technology. JiPANG performs uniform higher-level management of the computing services and resources being managed by individual Grid systems such as Ninf, NetSolve, Globus, etc. In order to give the user a uniform interface to the Grids JiPANG provides a set of simple Java APIs called the JiPANG Toolkits, and furthermore, allows the user to interact with Grid systems, again in a uniform way, using the JiPANG Browser application. With JiPANG, users need not install any client packages beforehand to interact with Grid systems, nor be concerned about updating to the latest version. Such uniform, transparent services available in a ubiquitous manner we believe is essential for the success of Grid as a viable computing platform for the next generation.

    • Title: Efficient Network and I/O Throttling for Fine-Grain Cycle Stealing
    • Authors:
      Kyung D. Ryu (Dept. of Computer Science and Engineering, Arizona State University)
      Jeffrey K. Hollingsworth (Dept. of Computer Science, University of Maryland)
      Peter J. Keleher (Dept. of Computer Science, University of Maryland)
    • Abstract:
      This paper proposes and evaluates a new mechanism, rate windows, for I/O and network rate policing. The goal of the proposed system is to provide a simple, yet effective way to enforce resource limits on target classes of jobs in a system. This work was motivated by our Linger Longer infrastructure, which harvests idle cycles in networks of workstations. Network and I/O throttling is crucial because Linger Longer can leave guest jobs on non-idle nodes and machine owners should not be adversely affected. Our approach is quite simple. We use a sliding window of recent events to compute the average rate for a target resource. The assigned limit is enforced by the simple expedient of putting application processes to sleep when they issue requests that would bring their resource utilization out of the allowable profile. Our I/O system call intercept model makes the rate windows mechanism light-weight and highly portable. Our experimental results show that we are able to limit resource usage to within a few percent of target usages.
  • Computational Grid I/O (Tuesday 3:30-5:00PM)
    Room A102/104/106
    Chair: Rich Wolsky, University of Tennessee, Knoxville

    • Title: LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications
    • Authors:
      Brian S. White (University of Virginia)
      Michael Walker (University of Virginia)
      Marty Humphrey (University of Virginia)
      Andrew S. Grimshaw (University of Virginia)
      Best Student Paper Finalist
    • Abstract:
      Realizing that current file systems can not cope with the diverse requirements of wide-area collaborations, researchers have developed data access facilities to meet their needs. Recent work has focused on comprehensive data access architectures. In order to fulfill the evolving requirements in this environment, we suggest a more fully-integrated architecture built upon the fundamental tenets of naming, security, scalability, extensibility, and adaptability. These form the underpinning of the Legion File System (LegionFS). This paper motivates the need for these requirements and presents benchmarks that highlight the scalability of LegionFS. LegionFS aggregate throughput follows the linear growth of the network, yielding an aggregate read bandwidth of 193.8 MB/s on a 100 Mbps Ethernet backplane with 50 simultaneous readers. The serverless architecture of LegionFS is shown to benefit important scientific applications, such as those accessing the Protein Data Bank, within both local-and wide-area environments.

    • Title: High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies
    • Authors:
      Bill Allcock (Argonne National Laboratory)
      Ian Foster (Argonne National Laboratory)
      Veronika Nefedova (Argonne National Laboratory)
      Ann Chervenak (Information Sciences Institute, University of Southern California)
      Ewa Deelman (Information Sciences Institute, University of Southern California)
      Carl Kesselman (Information Sciences Institute, University of Southern California)
      Jason Lee (Lawrence Berkeley National Laboratory)
      Alex Sim (Lawrence Berkeley National Laboratory)
      Arie Shoshani (Lawrence Berkeley National Laboratory)
      Bob Drach (Lawrence Livermore National Laboratory)
      Dean Williams (Lawrence Livermore National Laboratory)
    • Abstract:
      In numerous scientific disciplines, terabyte and soon petabyte-scale data collections are emerging as critical community resources. A new class of Data Grid infrastructure is required to support management, transport, distributed access to, and analysis of these datasets by potentially thousands of users. Researchers who face this challenge include the Climate Modeling community, which performs long-duration computations accompanied by frequent output of very large files that must be further analyzed. We describe the Earth System Grid prototype, which brings together advanced analysis, replica management, data transfer, request management, and other technologies to support high-performance, interactive analysis of replicated data. We present performance results that demonstrate our ability to manage the location and movement of large datasets from the user's desktop. We report on experiments conducted over SciNET at SC2000, where we achieved peak performance of 1.55Gb/s and sustained performance of 512.9Mb/s for data transfers between Texas and California.   

    • Title: Gathering at the Well: Creating Communities for Grid I/O
    • Authors:
      Douglas Thain (University of Wisconsin-Madison)
      John Bent (University of Wisconsin-Madison)
      Andrea Arpaci-Dusseau (University of Wisconsin-Madison)
      Remzi Arpaci-Dusseau (University of Wisconsin-Madison)
      Miron Livny (University of Wisconsin-Madison)
    • Abstract:
      Grid applications have demanding I/O needs. Schedulers must bring jobs and data in close proximity in order to satisfy throughput, scalability, and policy requirements. Most systems accomplish this by making either jobs or data mobile. We propose a system that allows jobs and data to meet by binding execution and storage sites together into I/O communities which then participate in the wide-area system. The relationships between participants in a community may be expressed by the ClassAd framework. Extensions to the framework allow community members to express indirect relations. We demonstrate our implementation of I/O communities by improving the performance of a key high-energy physics simulation on an international distributed system.
  • Algorithmic Load Balancing (Tuesday 3:30-5:00PM)
    Room A108/110/112
    Chair: Hugh Caffey, SUN

    • Title: Large Scale Parallel Structured AMR Calculations Using the SAMRAI Framework
    • Authors:
      Andrew M. Wissink (Lawrence Livermore National Lab)
      Richard D. Hornung (Lawrence Livermore National Lab)
      Scott R. Kohn (Lawrence Livermore National Lab)
      Steve S. Smith (Lawrence Livermore National Lab)
      Noah Elliott (Lawrence Livermore National Lab)
    • Abstract:
      This paper discusses the design and performance of the parallel data communication infrastructure in SAMRAI, a software framework for structured adaptive mesh refinement (SAMR) multi-physics applications. We describe requirements of such applications and how SAMRAI abstractions manage complex data communication operations found in them. Parallel performance is characterized for two adaptive problems solving hyperbolic conservation laws on up to 512 processors of the IBM ASCI Blue Pacific system. Results reveal good scaling for numerical and data communication operations but poorer scaling in adaptive meshing and communication schedule construction phases of the calculations. We analyze the costs of these different operations, addressing key concerns for scaling SAMR computations to large numbers of processors, and discuss potential changes to improve our current implementation.

    • Title: Parallel Interval-Newton Using Message Passing: Dynamic Load Balancing Strategies
    • Authors:
      Chao-Yang Gau (University of Notre Dame)
      Mark A. Stadtherr (University of Notre Dame)
    • Abstract:
      Branch-and-prune and branch-and-bound techniques are commonly used for intelligent search in finding all solutions, or the optimal solution, within a space of interest. The corresponding binary tree structure provides a natural parallelism allowing concurrent evaluation of subproblems using parallel computing technology. Of special interest here are techniques derived from interval analysis, in particular an interval-Newton/generalized-bisection procedure. In this context, we discuss issues of load balancing and work scheduling that arise in the implementation of parallel interval-Newton on a cluster of workstations using message passing, and describe and analyze techniques for this purpose. Results using an asynchronous diffusive load balancing strategy show that a consistently high efficiency can be achieved in solving nonlinear equations, providing excellent scalability, especially with the use of a two-dimensional torus virtual network. The effectiveness of the approach used, especially in connection with a novel stack management scheme, is also demonstrated in the consistent superlinear speedups observed in performing global optimization.   

    • Title: Dynamic Load Balancing of SAMR Applications on Distributed Systems
    • Authors:
      Zhiling Lan (Northwestern University)
      Valerie E. Taylor (Northwestern University)
      Greg Bryan (Massachusetts Institute of Technology)
      Best Student Paper Finalist
    • Abstract:
      Dynamic load balancing(DLB) for parallel systems has been studied extensively; however, DLB for distributed systems is relatively new. To efficiently utilize computing resources provided by distributed systems, an underlying DLB scheme must address both heterogeneous and dynamic features of distributed systems. In this paper, we propose a DLB scheme for Structured Adaptive Mesh Refinement(SAMR) applications on distributed system s. While the proposed scheme can take into consideration (1) the heterogeneity of processors and (2) the heterogeneity and dynamic load of the networks, the focus of this paper is on the latter. The load-balancing processes are divided into two phases: global load balancing and local load balancing. We also provide a heuristic method to evaluate the computational gain and redistribution cost for global redistribution. Experiments show that by using our distributed DLB scheme, the execution time can be reduced by 9%-46% as compared to using parallel DLB scheme which does not consider the heterogeneous and dynamic features of distributed systems.
  • Sea, Wind, & Fire (Wednesday 10:30AM-Noon)
    Room A102/104/106
    Chair: Tony Drummond, Lawrence Berkeley National Laboratory

    • Title: Coastal Ocean Modeling of the U.S. West Coast with Multiblock Grid and Dual-Level Parallelism
    • Authors:
      Phu V. Luong (Engineer Research and Development Center, Major Shared Resource Center)
      Clay P. Breshears (KAI Software, A Division of Intel Americas, Inc.)
      Le N. Ly (Naval Postgraduate School)
    • Abstract:
      In coastal ocean modeling, a one-block rectangular grid for a large domain has large memory requirements and long processing times. With complicated coastlines, the number of grid points used in the calculation is often the same or smaller than the number of unused grid points. These problems have been a major concern for researchers in this field.

      Multiblock grid generation and dual-level parallel techniques are solutions that can overcome these problems. The Multiblock Grid Princeton Ocean Model (MGPOM) uses Message Passing Interface (MPI) to parallelize computations by assigning each grid block to a unique processor. Since not all grid blocks are of the same size, the workload between MPI processes varies. Pthreads is used to improve load balance.

      Performance results from the MGPOM model on a one-block grid and a 29-block grid simulation for the U.S. west coast demonstrate the efficacy of both the MPI-Only and MPI-Pthreads code versions.

    • Title: Terascale spectral element dynamical core for atmospheric general circulation models
    • Authors:
      Richard D. Loft (National Center for Atmospheric Research)
      Stephen J. Thomas (National Center for Atmospheric Research)
      John M. Dennis (National Center for Atmospheric Research)
      Gordon Bell Prize Finalist
    • Abstract:
      Climate modeling is a grand challenge problem where scientific progress is measured not in terms of the largest problem that can be solved but by the highest achievable integration rate. These models have been notably absent in previous Gordon Bell competitions due to their inability to scale to large processor counts. A scalable and efficient spectral element atmospheric model is presented. A new semi-implicit time stepping scheme accelerates the integration rate relative to an explicit model by a factor of two, achieving 130 years per day at T63L30 equivalent resolution. Execution rates are reported for the standard shallow water and Held-Suarez climate benchmarks on IBM SP clusters. The explicit T170 equivalent multi-layer shallow water model sustains 343 Gflops at NERSC, 206 Gflops at NPACI (SDSC) and 127 Gflops at NCAR. An explicit Held-Suarez integration sustains 369 Gflops on 128 16-way IBM nodes at NERSC.

    • Title: High Resolution Weather Modeling for Improved Fire Management
    • Authors:
      Kevin Roe (Maui High Performance Computing Center)
      Duane Stevens(University of Hawaii)
      Carol McCord (Maui High Performance Computing Center)
    • Abstract:
      A critical element to the accurate prediction of fire/weather behaviour is the knowledge of near-surface weather. Weather variables, such as wind, temperature, humidity and precipitation, make direct impacts on the practice of managing prescribed burns and fighting wild fires. State-of-the-art Numerical Weather Prediction (NWP), coupled with the use of high performance computing, now enable significantly improved short-term forecasting of near-surface weather at a 1-3 km grid resolution.

      This proof of concept project integrates two complementary model types to aid federal agencies in real-time management of fire. (1) A highly complex, full-physics mesoscale weather prediction model (MM5) which is applied in order to estimate the weather fields up to 72 hours in advance. (2) A diagnostic fire behavior model (FARSITE) takes the near-surface weather fields and computes the expected spread rate of a fire driven by wind, humidity, terrain, and fuels (i.e. vegetation).

  • Reconfigurable Architectures (Wednesday 10:30AM-Noon)
    Room A108/110/112
    Chair: Steve Reinhardt, Silicon Graphics, Inc.

    • Title: Delivering Acceleration: The Potential for Increased HPC Application Performance Using Reconfigurable Logic
    • Authors:
      David Caliga (SRC Computers, Inc)
      David Peter Barker (SUPERsmith)
    • Abstract:
      SRC Computers, Inc. has integrated adaptive computing into its SRC-6 high-end server, incorporating reconfigurable processors as peers to the microprocessors. Performance improvements resulting from reconfigurable computing can provide orders of magnitude speedups for a wide variety of algorithms. Reconfigurable logic in Field Programmable Gate Arrays (FPGAs) has shown great advantage to date in special purpose applications and specialty hardware. SRC Computers is working to bring this technology into the general purpose HPC world via an advanced system interconnect and enhanced compiler technology.   

    • Title: Parallel Dedicated Hardware Devices for Heterogeneous Computations
    • Authors:
      Alessandro Marongiu (CASPUR, Roma)
      Paolo Palazzari (ENEA-HPCN, Roma)
      Vittorio Rosato (ENEA-HPCN, Roma)
    • Abstract:
      We describe a design methodology which allows a fast design and prototyping of dedicated hardware devices to be used in heterogeneous computations. The platforms used in heterogeneous computations consist of a general-purpose COTS architecture which hosts a dedicated hardware device; parts of the computation are mapped onto the former, parts onto the latter, in a way to improve the overall computation efficiency. We report the design and the prototyping of a FPGA-based hardware board to be used in the search of low-autocorrelation binary sequences. The circuit has been designed by using a recently developed Parallel Hardware Generator (PHG) package which produces a synthesizable VHDL code starting from the specific algorithm expressed as a System of Affine Recurrence Equations (SARE). The performance of the realized devices has been compared to those obtained on the same numerical application on several computational platforms.

    • Title: Cost Effectiveness of an Adaptable Computing Cluster
    • Authors:
      Keith D. Underwood (Clemson University)
      Ron R. Sass (Clemson University)
      Walter B. Ligon, III (Clemson University)
    • Abstract:
      With a focus on commodity PC systems, Beowulf clusters traditionally lack the cutting edge network architectures, memory subsystems, and processor technologies found in their more expensive supercomputer counterparts. What Beowulf clusters lack in technology, they more than make up for with their significant cost advantage over traditional supercomputers. This paper presents the cost implications of an architectural extension that adds reconfigurable computing to the network interface of Beowulf clusters. A quantitative idea of cost-effectiveness is formulated to evaluate computing technologies. Here, cost-effectiveness is considered in the context of two applications: the 2D Fast Fourier Transform (2D-FFT) and integer sorting.
  • Ground-Breaking Applications (Wednesday 1:30-3:00PM)
    Room A201/205
    Access Grid Enabled
    Chair: Jill Mesirov, Whitehead Institute

    • Title: Solution of a Three-Body Problem in Quantum Mechanics Using Sparse Linear Algebra on Parallel Computers
    • Authors:
      Mark Baertschy (University of Colorado)
      Xiaoye Li (Lawrence Berkeley National Laboratory)
    • Abstract:
      A complete description of two outgoing electrons following an ionizing collision between a single electron and an atom or molecule has long stood as one of the unsolved fundamental problems in quantum collision theory. In this paper we describe our use of distributed memory parallel computers to calculate a fully converged wave function describing the electron-impact ionization of hydrogen. Our approach hinges on a transformation of the Schrodinger equation that simplifies the boundary conditions but requires solving very ill-conditioned systems of a few million complex, sparse linear equations. We developed a two-level iterative algorithm that requires repeated solution of sets of a few hundred thousand linear equations. These are solved directly by LU factorization using a specially tuned, distributed memory parallel version of the sparse LU factorization library SuperLU. In smaller cases, where direct solution is technically possible, our iterative algorithm still gives significant savings in time and memory despite lower megaflop rates.

    • Title: Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference
    • Authors:
      Craig A. Stewart (Indiana University)
      David Hart (Indiana University)
      Donald K. Berry (Indiana University)
      Gary J. Olsen (University of Illinois Urbana Champaign)
      Eric A. Wernert (Indiana University)
      William Fischer (Indiana University)
    • Abstract:
      This paper describes the parallel implementation of fastDNAml, a program for the maximum likelihood inference of phylogenetic trees from DNA sequence data. Mathematical means of inferring phylogenetic trees have been made possible by the wealth of DNA data now available. Maximum likelihood analysis of phylogenetic trees is extremely computationally intensive. Availability of computer resources is a key factor limiting use of such analyses. fastDNAml is implemented in serial, PVM, and MPI versions, and may be modified to use other message passing libraries in the future. We have developed a viewer for comparing phylogenies. We tested the scaling behavior of fastDNAml on an IBM RS/6000 SP up to 64 processors. The parallel version of fastDNAml is one of very few computational phylogenetics codes that scale well. fastDNAml is available for download as source code or compiled for Linux or AIX.   

    • Title: Modeling of Seismic Wave Propagation at the Scale of the Earth on a Large Beowulf
    • Authors:
      Dimitri Komatitsch (California Institute of Technology)
      Jeroen Tromp (California Institute of Technology)
    • Abstract:
      We use a parallel spectral-element method to simulate the propagation of seismic waves generated by earthquakes in the entire 3-D Earth. The method is implemented using MPI on a large PC cluster (Beowulf) with 151 processors and 76 Gb of RAM. It is based upon a weak formulation of the equations of motion and combines the flexibility of a finite-element method with the accuracy of a pseudospectral method. The finite-element mesh honors all discontinuities in the Earth velocity model. To maintain a relatively constant number of grid points per seismic wavelength, the size of the elements is increased with depth in a conforming fashion, thus retaining a diagonal mass matrix. The effects of attenuation and anisotropy are incorporated. We benchmark spectral-element synthetic seismograms against a normal-mode reference solution for a spherically symmetric Earth velocity model. The two methods are in excellent agreement for all waves with periods greater than 20 seconds.
  • Information Retrieval & Transaction Processing (Wednesday 1:30-3:00PM)
    Room A102/104/106
    Chair: James Hoe, Carnegie Mellon University

    • Title: Efficient Execution of Multiple Query Workloads in Data Analysis Applications
    • Authors:
      Henrique Andrade (University of Maryland, College Park)
      Tahsin Kurc (The Ohio State University)
      Alan Sussman (University of Maryland, College Park)
      Joel Saltz (The Ohio State University)
    • Abstract:
      Applications that analyze, mine, and visualize large datasets are considered an important class of applications in many areas of science, engineering, and business. Queries commonly executed in data analysis applications often involve user-defined processing of data and application-specific data structures. If data analysis is employed in a collaborative environment, the data server should execute multiple such queries simultaneously to minimize the response time to clients. In this paper we present the design of a runtime system for executing multiple query workloads on a shared-memory machine. We describe experimental results using an application for browsing digitized microscopy images.

    • Title: Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C
    • Authors:
      Kenneth M. Wilson (Apple Computer Inc.)
      Bob B. Aglietti (Advanced Micro Devices)
    • Abstract:
      The use of CC-NUMA multiprocessors complicates the placement of physical memory pages. Memory closest to a processor provides the best access time, but optimal memory page placement is a difficult problem with process movement, multiple processes requiring access to the same physical memory page, and application behavior changing over execution time. We use dynamic page placement to move memory pages where needed for the database benchmark TPC-C executing on a four node CC-NUMA multiprocessor. Dynamic page placement achieves local memory accesses up to 73% of the time instead of the static page placement results of 34% locality achieved with first touch and 25% with round robin. This can result in a 17% improvement in performance.   

    • Title: Compressing Inverted Files in Scalable Information Systems by Binary Decision Diagram Encoding
    • Authors:
      Chung-Hung Lai (National Chung Cheng University)
      Tien-Fu Chen (National Chung Cheng University)
      Best Student Paper Finalist
    • Abstract:
      One of the key challenges of managing very huge volumes of data in scalable Information retrieval systems is providing fast access through keyword searches. The major data structure in the information retrieval system is an inverted file, which records the positions of each term in the documents. When the information set substantially grows, the number of terms and documents are significantly increased as well as the size of the inverted files.

      Approaches to reduce the inverted file without sacrificing the query efficiency are important to the success of scalable information systems. In this paper, we propose a compression approach by using Binary Decision Diagram Encoding (BDD) so that all possible ordering correlation among large amount of documents will be extracted to minimize the posting representation. Another advantage of using BDD is that BDD expressions can efficiently perform Boolean queries, which are very common in retrieval systems. Experiment results show that the compression ratios of the inverted files have been improved significantly by the BDD scheme.

  • Performance Prediction (Wednesday 1:30-3:00PM)
    Room A110/112
    Chair: Richard Barrett, Los Alamos National Laboratory

    • Title: On Using SCALEA for Performance Analysis of Distributed and Parallel Programs
    • Authors:
      Hong-Linh Truong (University of Vienna)
      Thomas Fahringer (University of Vienna)
      Georg Madsen (Technical University of Vienna)
      Allen D. Malony (University of Oregon)
      Hans Moritsch (University of Vienna)
      Sameer Shende (University of Oregon)
    • Abstract:
      In this paper we give an overview of SCALEA, which is a new performance analysis tool for OpenMP, MPI, HPF, and mixed parallel/distributed programs. SCALEA instruments, executes and measures programs and computes a variety of performance overheads based on a novel overhead classification. Source code and HW-profiling is combined in a single system which significantly extends the scope of possible overheads that can be measured and examined, ranging from HW-counters, such as the number of cache misses or floating point operations, to more complex performance metrics, such as control or loss of parallelism. Moreover, SCALEA uses a new representation of code regions, called the dynamic code region call graph, which enables detailed overhead analysis for arbitrary code regions. An instrumentation description file is used to relate performance information to code regions of the input program and to reduce instrumentation overhead. Several experiments with realistic codes that cover MPI, OpenMP, HPF, and mixed OpenMP/MPI codes demonstrate the usefulness of SCALEA.

    • Title: Modeling and Detecting Performance Problems for Distributed and Parallel Programs with JavaPSL
    • Authors:
      Thomas Fahringer (University of Vienna)
      ClÛvis Seragiotto J™nior (University of Vienna)
    • Abstract:
      In this paper we present JavaPSL, a Performance Specification Language that can be used for a systematic and portable specification of large classes of experiment-related data and performance properties for distributed and parallel programs. Performance properties are described in a generic and normalized way, thus interpretation and comparison of performance properties is largely alleviated. Moreover, JavaPSL provides meta-properties in order to describe new properties based on existing ones and to relate properties to each other.

      JavaPSL uses Java and its powerful mechanisms, in particular, polymorphism, abstract classes, and reflection to describe experiment-related data and performance properties. JavaPSL can also be considered as a performance information interface based on which sophisticated performance tools can be built or other tools can access performance data in a portable way.

      We have implemented a prototype performance tool that uses JavaPSL to automatically detect performance bottlenecks for MPI, OpenMP, and mixed OpenMP and MPI programs. Several experiments with realistic codes demonstrate the usefulness of JavaPSL.

    • Title: Predictive Performance and Scalability Modeling of a Large-Scale Application
    • Authors:
      D. J. Kerbyson (Los Alamos National Laboratory)
      H. J. Alme (Los Alamos National Laboratory)
      A. Hoisie (Los Alamos National Laboratory)
      F. Petrini (Los Alamos National Laboratory)
      H. J. Wasserman (Los Alamos National Laboratory)
      M. Gittings (SAIC and Los Alamos National Laboratory)
    • Abstract:
      In this work we present a predictive analytical model that encompasses the performance and scaling characteristics of an important ASCI application. SAGE (SAIC's Adaptive Grid Eulerian hydrocode) is a multidimensional hydrodynamics code with adaptive mesh refinement. The model is validated against measurements on several systems including ASCI Blue Mountain, ASCI White, and a Compaq Alphaserver ES45 system showing high accuracy. It is parametric - basic machine performance numbers (latency, MFLOPS rate, bandwidth) and application characteristics (problem size, decomposition method, etc.) serve as input. The model is applied to add insight into the performance of current systems, to reveal bottlenecks, and to illustrate where tuning efforts can be effective. We also use the model to predict performance on future systems.
  • Algorithms (Wednesday 3:30-5:00PM)
    Room A102/104/106
    Chair: Olaf Lubeck, Los Alamos National Laboratory

    • Title: Stable, Globally Non-iterative, Non-overlapping Domain Decomposition Parallel Solvers for Parabolic Problems
    • Authors:
      Yu Zhuang (Texas Tech University)
      Xian-He Sun (Illinois Institute of Technology)
    • Abstract:
      In this paper, we report a class of stabilized explicit-implicit domain decomposition (SEIDD) methods for the parallel solution of parabolic problems, based on the explicit-implicit domain decomposition (EIDD) methods. EIDD methods are globally non-iterative, non-overlapping domain decomposition methods which, when compared with Schwarz alternating algorithm based parabolic solvers, are computationally and communicationally efficient for each simulation time step but suffer from time step size restrictions due to conditional stability or conditional consistency. By adding a stabilization step to the EIDD methods, the SEIDD methods are freed from time step size restrictions while retaining EIDD's computational and communicational efficiency for each time step, rendering themselves excellent candidates for large-scale parallel simulations. Three algorithms of the SEIDD type are implemented, which are experimentally tested to show excellent stability, computation and communication efficiencies, and high parallel speedup and scalability.

    • Title: Stochastic Search for Signal Processing Algorithm Optimization
    • Authors:
      Bryan Singer (Carnegie Mellon University)
      Manuela Veloso (Carnegie Mellon University)
    • Abstract:
      This paper presents an evolutionary algorithm for searching for the optimal implementations of signal transforms and compares this approach against other search techniques. A single signal processing algorithm can be represented by a very large number of different but mathematically equivalent formulas. When these formulas are implemented in actual code, unfortunately their running times differ significantly. Signal processing algorithm optimization aims at finding the fastest formula. We present a new approach that successfully solves this problem, using an evolutionary stochastic search algorithm, STEER, to search through the very large space of formulas. We empirically compare STEER against other search methods, showing that it notably can find faster formulas while still only timing a very small portion of the search space.

    • Title: A Hypergraph-Partitioning Approach for Coarse-Grain Decomposition
    • Authors:
      Umit V. Catalyurek (The Ohio State University)
      Cevdet Aykanat (Bilkent University)
    • Abstract:
      We propose a new two-phase method for the coarse-grain decomposition of irregular computational domains. This work focuses on the 2D partitioning of sparse matrices for parallel matrix-vector multiplication. However, the proposed model can also be used to decompose computational domains of other parallel reduction problems. This work also introduces the use of multi-constraint hypergraph partitioning, for solving the decomposition problem. The proposed method explicitly models the minimization of communication volume while enforcing the upper bound of p+q-2 on the maximum number of messages handled by a single processor, for a parallel system with P = p x q processors. Experimental results on a wide range of realistic sparse matrices confirm the validity of the proposed methods, by achieving up to 25 percent better partitions than the standard graph model, in terms of total communication volume, and 59 percent better partitions in terms of number of messages, on the overall average.
  • Novel Graphics & Grids (Wednesday 3:30-5:00PM)
    Room A110/112
    Chair: Maxine Brown, University of Illinois

    • Title: Fast Matrix Multiplies using Graphics Hardware
    • Authors:
      E. Scott Larsen (University of North Carolina at Chapel Hill)
      David McAllister (University of North Carolina at Chapel Hill)
      Best Student Paper Finalist
    • Abstract:
      We present a technique for large matrix-matrix multiplies using low cost graphics hardware. The result is computed by literally visualizing the computations of a simple parallel processing algorithm. Current graphics hardware technology has limited precision and thus limits immediate applicability of our algorithm. We include results demonstrating proof of concept, correctness, speedup, and a simple application. This is therefore forward looking research: a technique ready for technology on the horizon.   

    • Title: Next-Generation Visual Supercomputing using PC Clusters with Volume Graphics Hardware Devices
    • Authors:
      Shigeru Muraki (:The National Institute of Advanced Industrial Science and Technology (AIST), Japan)
      Masato Ogata (Mitsubishi Precision Co., Ltd. (MPC))
      Kwan-Liu Ma (University of California, Davis)
      Kenji Koshizuka (MPC)
      Kagenori Kajihara (MPC)
      Xuezhen Liu (MPC)
      Yasutada Nagano (MPC)
      Kazuro Shimokawa (:The National Institute of Advanced Industrial Science and Technology (AIST), Japan)
    • Abstract:
      To seek a low-cost, extensible solution for the large-scale data visualization problem, a visual computing system is designed as a result of a collaboration between industry and government research laboratories in Japan, also with participation by researchers in U.S. This scalable system is a commodity PC cluster equipped with the VolumePro 500 volume graphics cards and a specially designed image compositing hardware. Our performance study shows such a system is capable of interactive rendering 512^3 and 1024^3 volume data and highly scalable. In particular, with such a system, simulation and visualization can be performed concurrently which allows scientists to monitor and tune their simulations on the fly. In this paper, both the system and hardware designs are presented.

    • Title: Global Static Indexing for Real-time Exploration of Very Large Regular Grids.
    • Authors:
      Valerio Pascucci (Lawrence Livermore National Laboratory)
      Randall Frank (Lawrence Livermore National Laboratory)
    • Abstract:
      In this paper we introduce a new indexing scheme for progressive traversal and visualization of large regular grids. We demonstrate the potential of our approach by providing a tool that displays at interactive rates planar slices of scalar field data with very modest computing resources. We obtain unprecedented results both in terms of absolute performance and, more importantly, in terms of scalability. On a laptop computer we provide real time interaction with a 2048^3 grid (8 Giga-nodes) using only 20MB of memory. On an SGI Onyx we slice interactively an 8192^3 grid (0.5 tera-nodes) using only 60MB of memory. The scheme relies simply on the determination of an appropriate reordering of the rectilinear grid data and a progressive construction of the output slice. The reordering minimizes the amount of I/O performed during the out-of-core computation. The progressive and asynchronous computation of the output provides flexible quality/speed tradeoffs and a time-critical and interruptible user interface.
  • Computational Grid Applications (Thursday 10:30AM-Noon)
    Room A102/104/106
    Chair: Ann Chervenak, Georgia Tech

    • Title: Applying Scheduling and Tuning to On-line Parallel Tomography
    • Authors:
      Shava Smallen (University of California, San Diego)
      Henri Casanova (University of California, San Diego)
      Francine Berman (University of California, San Diego)
      Best Student Paper Finalist
    • Abstract:
      Tomography is a popular technique to reconstruct the three-dimensional structure of an object from a series of two-dimensional projections. Tomography is resource-intensive and deployment of a parallel implementation onto Computational Grid platforms has been studied in previous work. In this work, we address on-line execution of the application where computation is performed as data is collected from an on-line instrument. The goal is to compute incremental 3-D reconstructions that provide quasi-real-time feedback to the user.

      We model on-line parallel tomography as a tunable application: trade-offs between resolution of the reconstruction and frequency of feedback can be used to accommodate various resource availabilities. We demonstrate that application scheduling/tuning can be framed as multiple constrained optimization problems and evaluate our methodology in simulation. Our results show that prediction of dynamic network performance is key to efficient scheduling and that tunability allows for production runs of on-line parallel tomography in Computational Grid environments.

    • Title: An Automatic Design Optimization Tool and its Application to Computational Fluid Dynamics
    • Authors:
      David Abramson (Monash University)
      Andrew Lewis (Griffith University)
      Tom Peachey (Monash University)
      Clive Fletcher (University of New South Wales)
    • Abstract:
      In this paper we describe the Nimrod/O design optimization tool, and its application in computational fluid dynamics. Nimrod/O facilitates the use of an arbitrary computational model to drive an automatic optimization process. This means that the user can parameterise an arbitrary problem, and then ask the tool to compute the parameter values that minimize or maximise a design objective function. The paper describes the Nimrod/O system, and then discusses a case study in the evaluation of an aerofoil problem. The problem involves computing the shape and angle of attack of the aerofoil that maximises the lift to drag ratio. The results show that our general approach is extremely flexible and delivers better results than a program that was developed specifically for the problem. Moreover, it only took us a few hours to set up the tool for the new problem and required no software development.

    • Title: Numerical Libraries And The Grid: The GrADS Experiments With ScaLAPACK
    • Authors:
      Antoine Petitet (Sun France Benchmark Center)
      Susan Blackford (University of Tennessee)
      Jack Dongarra (University of Tennessee)
      Brett Ellis (University of Tennessee)
      Graham Fagg (University of Tennessee)
      Kenneth Roche (University of Tennessee)
      Sathish Vadhiyar (University of Tennessee)
    • Abstract:
      This paper describes an overall framework for the design of numerical libraries on a computational Grid of processors where the processors may be geographically distributed and under the control of a Grid-based scheduling system. A set of experiments are presented in the context of solving systems of linear equations using routines from the ScaLAPACK software collection along with various grid service components, such as Globus, NWS, and Autopilot.
  • Networking (Thursday 10:30AM-Noon)
    Room A110/112
    Chair: Steve Lumetta, University of Illinois

    • Title: EMP: Zero-copy OS-bypass NIC-drive