Posters Details

Click on any Poster below for details, then click your browsers BACK button to return to this list.

Multiple Paths for End-to-End Delay Minimization for Distributed Computing Over the Internet,
Nageswara S. V. Rao, Oak Ridge National Laboratory

The end-to-end delays experienced by messages exchanged between processes distributed over the Internet contain significant "random" components due to the complicated nature of traffic, in addition to bandwidth limits. We propose a measurement-based method for achieving low end-to-end delays over the Internet by using user-level daemons. These daemons handle the network tasks, and also perform transport-level routing using delay-regressions of the network links. They explicitly realize multiple paths via themselves without the router support, and achieve physical diversity of the transmission paths as well as higher aggregated bandwidth compared to the usual and parallel TCP methods. Our experimental results indicate that this method is a viable and practical means for achieving low end-to-end delays for distributed computing applications over the Internet.

MPI I/O Performance of the Gigabyte System Network
Sunlung Suen and Parks M. Fields, Los Alamos National Laboratory

The fastest supercomputers harness thousands of processing nodes. However, with such large component counts, mean time to failure (MTTF) is often just a few hours. High-bandwidth checkpoints are crucial to enabling practical, large-scale utilization of these machines. MPI 2.0 adds parallel I/O functions for checkpoints. Unfortunately, adoption of the standard has been slowed by lackluster performance achieved against proprietary I/O methods. We demonstrate that high-bandwidth MPI I/O is sustainable with the new Gigabyte System Network (GSN).

Market-Based Architecture for Supercomputer Resource Management
Stephen D. Kleban and Scott Clearwater Sandia National Labs

The First In Dutch Auction (FIDA) is a novel market-based resource management system designed for Supercomputer centers. FIDA offers the predictability of a FIFO queue along with the flexibility of a market-based auction to automatically handle high-priority jobs. The ability for users to select their own priority on a job-by-job basis is the key feature that distinguishes FIDA from other Supercomputer resource management systems. Users thus have a mechanism for determining their jobs' priority and the running order, as opposed to an algorithmic assignment of priorities. FIDA also permits backfilling and checkpoint restarts, all within its auction and predictability paradigm.

Fault-Tolerant Libraries for Data Processing Applications
Daniel S. Katz, Jet Propulsion Laboratory / California Institute of Technology

Many applications, such as the science data-processing applications developed as part of NASA's Remote Exploration and Experimentation (REE) Project, are mainly composed of linear operations, using linear algebra and signal processing libraries such as BLAS, ScaLAPACK, PLAPACK, and FFTW. REE's goal is to move ground-based supercomputing to space. Using COTS processors in the galactic cosmic-ray environment means that these processors will be subjected to transient faults. This paper discusses low-overhead methods that have been developed, implemented, and tested by REE to allow linear routines to detect transient faults, while still maintaining the original APIs.

Performance Studies on Compaq Clusters, Including Multi-Rail Performance On Standard Benchmarks
Henry J. Alme, Darren Kerbyson, Adolfy Hoisie, and Fabrizio Petrini, Los Alamos National Laboratory; David M. Race, Lori A. Pritchett, and John Daly, Raytheon

This poster presents serial and parallel performance characteristics of five Compaq computer systems: AlphaServer ES40, ES45, GS320 and the AlphaServer SC using ES40 and ES45 nodes. We study performance on three levels: the hardware, the interconnect, and the application. The variables considered include processor speed, interconnect topology (ie. the ccNUMA of the GS320 vs the QsNet of the SC computers), number of interconnect rails, amount of user memory, and number of processors. The results show that the SC system has very good performance characteristics in some areas (an MPI "ping" latency of 4-6 microseconds), very good scalability on some applications (such as Linpack), but some areas that require significant performance improvements.

Distributed Production Control System
Morris Jette, Robert Wood, Philip Eckert, and Gregg Hommes, Lawrence Livermore National Laboratory

Distributed Production Control System (DPCS) provides sophisticated scheduling, near real-time resource allocation, and resource accounting across a heterogeneous collection of computers. Of particular interest are DPCS' tight integration with gang schedulers providing job preemption, optimized backfill scheduling for heterogeneous clusters, and forecasting memory demands of pending jobs combining customer input with historical information. Resource consumption rates of jobs initiated through any means (batch, interactive, cron. etc.) are both monitored and controlled. Resource accounting and workload characterization information is stored in a database with Kerberos authenticated web access for graphical and tabular reports.

Hybrid High Performance Networking and High Performance Computing with Java Client/Server Enhanced Technology
Jun Ni, Research Technology, University of Iowa ITS

This poster presents our work in developing a hybrid HPN-HPC system using Java client/server technology and parallel computing together. The system is composed of various client/server models. Each server cluster contains multiple servers, which performs specific tasks with Java/C/C++ parallel computing. Each client cluster contains multiple clients, which acts not only as a communication interface to servers, but also performs customized sub tasks such as interactive system design, problem initialization, distributed computing, and visualization. The system is built upon large-scale HPN and internal communication of clusters with interactive neuro-design.

Armada: A Framework for Parallel I/O on Computational Grids
Ron Oldfield and David Kotz, Dartmouth College

High-performance computing increasingly occurs on "computational grids" composed of heterogeneous and geographically distributed systems of computers, networks, and storage devices. A great challenge for grid systems is to provide efficient access to distributed datasets. Our solution is a framework, called Armada, that allows applications and dataset providers to flexibly compose graphs of processing modules that describe the distribution, application interfaces, and processing required of the dataset before computation. The Armada runtime system restructures each graph, and places the processing modules at appropriate hosts to reduce network traffic. We also present results demonstrating the effectiveness of our approach.

DHARMA:Domain-Specific Metaware for Hydrologic Applications
Daniel Andresen, Mitchell Neilsen, and Gurdip Singh, Dept. of Computing and Information Sciences, Kansas State University; Prasanta Kalita and Michael C. Hirschi, Dept. of Agricultural Engineering, University of Illinois, Urbana-Champaign

The DHARMA domain-specific middleware system is intended to allow hydrologic field engineers to tackle water-management problems on a scale previously impossible without sophisticated computational management systems. DHARMA provides automatic data acquisition via the Internet; data fusion from online, local, and cached resources; smart caching of intermediate results; and smart scheduling for metacomputing systems. Our target watershed model, WEPP, is limited to very small watersheds with current computer technology. A revolutionary change in hydrologic modeling on the watershed scale will be brought about by applying WEPP to the 925 sq. miles Lake Decatur watershed.

The Scalable Simulation Framework
David M. Nicol and Jason Liu, Dartmouth College

SSF is a widely used high performance simulation system designed for the modeling of large scale computer and communication networks, in both C++ and Java. SSF defines base classes which enable mixed process-oriented and event-oriented modeling, using native Java or C++. It also includes the Domain Modeling Language, which enables the user to build a domain-specific database of model objects, from which large scale models are easily described and built. We illustrate the approach, showing how to do mixed level direct execution driven studies of signal processing algorithms running on hypothetical high performance architectures.

End-to-End Bandwidth Measurement Using Iperf
Ajay Tirumala, National Laboratory for Applied Network Research (NLANR)

Iperf ( is a tool for end users which measures the maximal network bandwidth for TCP streams and the jitter and loss for UDP streams, while also allowing advanced users to modify various OS parameters and measure the achievable bandwidth. The tool uses a custom-developed "binary exponential increase/backoff" algorithm for determining the optimal maximum TCP window size. Since bandwidth achieved also depends on the type of stream (E.g. Multimedia/ Compressed or Uncompressed ), users can create representative streams and measure the bandwidth achievable. Multicast bandwidths achievable using the infrastructure limits can also be measured using Iperf.

An Approach to Extreme-Scale Simulation of Novel Architectures
Kei Davis, Francis Alexander, Kathryn Berkbigler, Graham Booker, Brian Bush, Adolfy Hoisie, Donner Holten, and Steve Smith, Los Alamos National Laboratory; Thomas Caudell, Tim Eyring, and Kenneth Summers, University of New Mexico

Better hardware design and lower development costs involve performance evaluation, analysis, and modeling of parallel applications and architectures, and in particular predictive capability. We outline an approach to simulating computing architectures applicable to extreme-scale systems (thousands of processors) and to advanced, novel architectural configurations. Our component-based design allows for the seamless assembly of architectures from representations of workload, processor, network interface, switches, etc., with disparate and variable resolutions into an integrated simulation model. Our initial prototype, comprising low-fidelity models of workload and network, easily scales to many thousands of computational nodes in a fat-tree network.

Windows Performance Monitoring and Data Reduction using WatchTower and Argus
Michael W. Knop, Praveen K. Paritosh, and Peter A. Dinda, Northwestern University; Jennifer M. Schopf, Argonne National Laboratory

WatchTower is a system that simplifies the collection of Windows performance data, and Argus is a statistical methodology for evaluating this data and reducing the sheer volume of it. WatchTower's overheads are comparable to those of Microsoft's Perfmon tool, while it supports higher sampling rates and greater embedability into other software. Argus can reduce the behavior of a large number of performance counters into a few composite counters, or it can select a subset of counters that are statistically interesting. We are currently integrating WatchTower and Argus into a single system which we believe will be useful in a number of scenarios.

Multicasts for Faster Scientific Applications on Beowulf Clusters
Peter Tamblyn, Hal Levison, and Erik Asphaug, Southwest Research Institute & Binary Astronomy, LLC

We describe the use of reliable multicasts in existing message passing (MPI) programs to reduce collective communication bottlenecks. By broadcasting information through one point-to-multipoint message instead of a series of (tree-structured) point-to-point messages, communication delays can be reduced by factors up to log_2(Nodes), e.g., 5x for 32-node clusters. This reduction in the cost of collective communications broadens the class of problems appropriate for Beowulf-style clusters. We present raw communication and application speed-up results, and discuss applicability to Internet2. Further information and free software is available at

General Framework for Wireless Smart Distributed Sensors
Rob Armstrong, Nina Berry, Ron Kyker, Carmen Pancerella, and Christine Yang, Sandia National Laboratories, Livermore; Katie Moor, University of Notre Dame; Alicia (Pippin) Wolfe, University of Mass, Amherst ; Eric Burns, Rutgers University; Brian Lambert, University of North Carolina at Charlotte Stephen Elliot, Yale University' Tony Fan, RPI Chris Kershaw, University of California, Santa Cruz; Hillary Davis, Sierra High School (Manteca, CA)

Many situations call for the use of sensors monitoring physiological and environmental data. In these situations, it is beneficial to have intelligent agents analyze large amounts of sensor data and a distributed wireless network to disseminate information from the sensors. Rather than write a custom software and hardware platform for each such situation, we create a generic framework that can be configured for specific applications. Our system interfaces sensors with small, light weight personal processors that communicate wirelessly with other computers. Applications include monitoring of individuals' health, chemical composition of air/water, and support for teams working in emergency situations.

Scientific Computing with FPGAs
Allan E. Snavely, Matt Lyons, and Xin Zhao UCSD

FPGA systems have traditionally been used for hardware design but not for production computing because FPGAs, while ideal for rapid hardware prototyping, are slower than specialized hardware. However, in a day when "specialized" RISC micro-processors often run at 5% of peak performance on scientific codes, and given the cheap cost of FPGAs, (and the high cost of fabrication), it makes sense to examine the cost-performance benefits of scientific calculations using rapidly reconfigurable FPGAs. We do that here, investigating the feasibility, cost, and performance of the Star Bridge system on NAS Parallel Benchmarks. See and

Performance of Benchmarks for Scientific Computing on Intel's Flagship Processors
David Lifka, Gerd Heber, and Veaceslav Zaloj, Cornell Theory Center

Which Intel architecture should I choose to build my next Cluster? Our presentation attempts to answer this question for Intel's current generation of IA32 and IA64 based processors. Answers depend on characteristics of the targeted application(s), and the industry standard benchmarks that are most similar. We will present the results of a variety of sequential and parallel benchmark suites run on new cluster configurations. We discuss how the maturity of the currently available development and tuning tools assist and/or fall short in the optimization of research codes on these processors.

XPARE - eXPeriment Alerting and REporting
Alan K. Morris, J. Davison de St. Germain, and Steven G. Parker, University of Utah; Allen D. Malony and Sameer S. Shende, University of Oregon

XPARE (eXPeriment Alerting and REporting) tools allow software teams developing large-scale parallel applications to accomplish two important goals. First, it enables a team to specify regression testing benchmarks for a given set of performance measures. These benchmarks are evaluated with each periodically scheduled testing trial. Second, throughout the course of development, XPARE provides a historical panorama of the evolution of performance as it tracks software versions. This includes not only changes in the code, but also platform, choice of compiler, different optimizations and other performance factors.

Using Wavelet Analysis Feature Extraction as a Tool for Load Balancing
John R. Johnson and Leland M. Jameson, Lawrence Livermore National Laboratory

One of the challenges in constructing numerical schemes that are both adaptive and suitable for parallel architectures is maintaining a balanced load across the processing elements while using a method that is efficient and scalable. In this paper we use the feature extraction properties of wavelet analysis to construct an adaptive method efficient, scalable and load balanced. This method offers significant speedup over lower order adaptive schemes while maintaining uniform error across the domain.

Dynamically Replicated Storage for Genomic Alignment
Preethy Vaidyanathan and Tara Madhyastha, University of California, Santa Cruz; Terry Jones, Lawrence Livermore National Laboratory

We characterized the I/O behavior of a computational biology application on Linux clusters with different file systems and an IBM SP/2. This application played a vital role in the Human Genome Project. This study shows that locality is a very important factor affecting the performance of this application. We present the design of a user-level library for a new model of location-transparent storage to automatically redirect read accesses to the most appropriate location.

The Parallel Implementation of FCI Program in GAMESS
Zhengting Gan(a), Yuri Alexeev(a,b), Joe Ivanic(c), Mark S. Gordon(a,b), and Ricky A. Kendall(a,d) a)Scalable Computing Laboratory, USDOE Ames Laboratory; b)Department of Chemistry, Iowa State University; c)Fundamental Interactions Program, USDOE Ames Laboratory; d)Department of Computer Science, Iowa State University

In this poster we present our work in implementation of parallel GAMESS full configuration-interaction (FCI) module on PC clusters. Both the distributed and the replicated data methodology are used in the implementation. Parallelism is mainly achieved by splitting parallel tasks and load balancing. The effectiveness of the load balance scheme in our algorithm is demonstrated by examples. Calculations also show that the collective operation related with the replicated data approach is the major bottleneck on PC clusters for applications like FCI. A fully distributed data parallel FCI code is being developed.

Dynamic Mesh-Particle Partitioning of Parallel In-Element Particle Tracking Methods
Jing-Ru C. Cheng and Paul E. Plassmann, The Pennsylvania State University

In this poster we present new parallel algorithms and experimental results for computations based on particle tracking methods. Particle tracking methods are a versatile computational technique central to the simulation of a wide range of scientific applications including: visualization, molecular dynamics direct simulation Monte Carlo methods, and Eulerian-Lagrangian methods. We introduce a common framework, the ``in-element'' particle tracking method, based on the assumption that particle trajectories are computed by problem data localized to individual element. We present a dynamic load-balancing scheme to handle the dynamic nature of these calculations and present experimental results detailing the performance of these methods. HTTP://

SMiLE: An Integrated, Multi-Paradigm Infrastructure for High Performance Computing on SCI-Based Clusters
Martin Schulz, Carsten Trinitis, Jie Tao, and Wolfgang Karl Technische Universitat Munchen

The SMiLE (Shared Memory in a LAN-like Environment) project at LRR-TUM a comprehensive multi-paradigm software infrastructure for SCI (Scalable Coherent Interface) based PC clusters. It directly exploits SCI's high-performance user-level communication features to efficiently implement a large number of both message passing and shared memory APIs suited for virtually all application domains. In addition, SMiLE also provides a hybrid hardware/software performance tool infrastructure which allows users to optimize their applications. In summary, SMiLE thereby provides a highly flexible and integrated environment for the efficient exploitation of this promising cluster architecture.

MP_Lite for M-VIA on Linux Clusters
Weiyi Chen, Dave Turner, and Ricky Kendall, Ames Laboratory/Iowa State University

MP_Lite is a light weight message-passing library designed to deliver the maximum performance to applications in a portable and user-friendly manner. MP_Lite M-VIA combines the high efficiency of MP_Lite and high performance of the Virtual Interface Architecture to provide a low latency and high throughput message-passing system for both Fast Ethernet and Gigabit Ethernet networks. The library also has the ability to channel-bonding multiple network interface cards to increase the communication rate between nodes. Using 2-3 Fast Ethernet cards per machine can double or triple the maximum throughput without increasing the cost of a PC cluster greatly.

How to Achieve 1 GByte/sec I/O Throughput With Commodity IDE Disks
Jens Mache, Joshua Bower-Cooley, Jason Guchereau, Paul Thomas, and Matthew Wilkinson, Lewis & Clark College

Parallel I/O throughput of 1 Gigabyte/sec was first achieved on ASCI Red (with 18 hardware RAIDs costing $1,000,000). Our goal was to sustain similar performance on PC clusters with commodity IDE disks. We succeeded in improving I/O price/performance by over a factor of 100 by configuring the Parallel Virtual File System PVFS with 32 overlapped compute and I/O nodes, each having two IDE disks in a software RAID configuration (for $224). With appropriate file view and stripe size such that most disk accesses were local, we even measured up to 2007.2 MBytes/sec read throughput and 1698.9 MBytes/sec write throughput.

Global Synchronization of Computing Systems
James Harden, Donna Reese, and Richard Barnes, Mississippi State University

This research investigates precision timestamping of events on a physically distributed computing system. Precise timestamps are necessary for database systems, communications, system performance and parallel application debugging. This research uses a re-radiation antennae system and the Global Position System (GPS) for global time synchronization. The GPS approach has better precision, by several orders of magnitude, than traditional software methods. The re-radiation system allows the GPS signals to be received by a receiver even when line-of-sight to a satellite is unavailable. The approach taken in this research is to use a combination of low cost hardware and software to correlate distributed events to a universal time.

Performance Analysis of a CFD Code in the TeraCluster System
Kum Won Cho, Jungwoo Hong, and Sangsan Lee, KISTI Supercomputing Center

A TeraCluster Project in the KISTI Supercomputing Center was initiated to explore the possibility of PC clusters as a scientific computing platform to replace the Cray T3E system in KISTI by 2002. Since actual performance of a computing system varies significantly for different architectures, representative in-house codes from major application fields are tested to evaluate the actual performance of systems with different combination of CPUs, networks and their topology. Several CFD problems are simulated on a set of Linux clusters and evaluated against Cray T3E.

A 4Gbps Long Distance File Sharing Facility for Scientific Data Processing
Ryutaro Kurusu, Mary Inaba, Junji Tamatsukuri, and Kei Hiraki, University of Tokyo ; Hisashi Koga, Akira Jinzaki, and Yukichi Ikuta, Fujitsu Program Laboratories Ltd.

Data-Reservoir is a global file caching system that realizes very high-speed point-to-point transfer of huge data files to support data intensive scientific research projects. Main features of the system are (1) single huge data transfer can be done at about 90% of the bandwidth of the network, (2) users in local network do not have to recognize any difference from usual file servers, and (3) system has scalability to no less than 100Gbps network. The key idea is utilization of low-level protocol iSCSI, and hierarchical disk striping technique. Our prototype model consists of an NFS server, four disk servers (DELL Power Edge 1550, Dual Pentium III 1 GHz, dual CPU, 1GB memory, Linux kernel 2.4.4, 36GB 10,000rpm disks, and Netgear gigabit ethernet PCI card.), a gigabit Ethernet switch (Extreme 5i). Our preliminary performance measurement shows scalability up to 4Gbps.

ILIB_GMRES:An Auto-Tuning Parallel Iterative Solver for Linear Equations
Hisayasu Kuroda, Takahiro Katagiri, Makoto Kudoh, and Yasumasa Kanada, The University of Tokyo

High performance numerical parallel libraries are strongly required with growing trend of parallel computers. Many numerical libraries force users to set up not a few library-parameters and problem-specific parameters. Our auto-tuning parallel iterative solver called ILIB_GMRES includes optimization for communication methods among processors in addition to that for computation kernels. This auto-tuning is performed both at installation time and at runtime. We present the performance of this solver on the HITACHI SR2201, HITACHI SR8000, FUJITSU VPP800, NEC SX-5, Pentium III Cluster, SUN-e3500, SGI-2100 and the COMPAQ-GS80. There was a case that our solver is approximately four times as fast as the public domain library PETSc.

Comparison of Subsampling Approaches for Remote, Interactive Exploration of Large Data Sets
Raymond Loy, University of Chicago; Lori Freitag, Argonne National Laboratory; Mukta Nandwani, Virginia Tech

Interactively exploring terabyte data sets is an extremely challenging task, particularly for scientists whose primary access to visualization resources is a desktop graphics workstation. One approach to solving this problem is data set subsampling. We present a parallel algorithm for multiresolution subsampling, and give performance models that compare this approach to uniform grid subsampling. We show the validation and use of the models to predict performance as parameters such as problem size and network bandwidth change. We present results comparing the two methods and include images from Rayleigh-Taylor, neutron star X-ray burst, and hairpin vortex simulations.

High-Performance MIMD Computation for Out-of-Core Volume Visualization
Huijuan Zhang and Timothy S. Newman, Department of Computer Science, University of Alabama in Huntsville

One popular volume data visualization technique is isosurface extraction and rendering. Previously, some parallel isosurface extraction approaches have been proposed, but few methods have focused on achieving high performance when data cannot reside in-core. This poster introduces a new approach to achieve well-balanced MIMD execution for out-of-core isosurface extraction. The approach includes a partitioning stage that improves memory performance by eliminating non-active data elements. A highly accurate work prediction stage keeps the load well-balanced. Experimental results on a cluster computer demonstrate that the approach balances the load more evenly than other existing approaches and the speedup appears linear.

Managing Tile Size Variance in Serial Sparse Tiling
Michelle Mills Strout, Jeanne Ferrante, and Larry Carter UC, San Diego

In modern computer architectures with memory hierarchies, a program's data locality significantly affects performance. Serial sparse tiling improves the data locality of iterative computations on an irregular mesh M. It divides the iterative computation to be performed on M into roughly equal sized "tiles" that can be executed atomically. The tiles are grown from a seed partitioning of the mesh M, which logically resides at one of the iterations. We present an improved technique that, by growing tiles from a middle iteration, reduces the variance of the tiles' sizes.

Dynamic Right-Sizing: TCP Flow-Control Adaptation
Mike Fisk, Los Alamos National Laboratory and UCSD ; Wu-chun Feng, Los Alamos National Laboratory and Ohio State University

Network bandwidth has kept pace with the widespread arrival of bandwidth-intensive applications such as streaming media and grid computing, but the TCP flow-control implementations in most operating systems make it difficult or impossible for applications to take advantage of high-bandwidth WANs. Dynamic Right-Sizing is an operating system technique for automatically tuning TCP to solve this problem. Compared to previous work, Dynamic Right-Sizing is more efficient and transparent and applies to a wider set of scenarios by simultaneously supporting network-bound senders, application-bound receivers, and both high- and low-bandwidth links.

Impulse: A Smarter Main Memory Controller
John Carter, Wilson Hsieh, Sally McKee, Michael Parker, and Lixin Zhang, University of Utah

Memory bottlenecks keep modern processors from achieving near peak performance for many compute-intensive codes (e.g., data mining, image processing, and sparse matrix operations). The Impulse smart main memory controller (MMC) exports scatter-gather operations to software running on conventional processors, thereby giving them control over what, when, and where data is loaded into the processor cache(s). Detailed simulation indicates that the Impulse MMC can improve the performance of applications with poor memory locality by 2-5X. More information on Impulse can be found at

Transport Level Protocols: Performance Evaluation for Bulk Data Transfers
Matei Ripeanu , The University of Chicago

Before developing new protocols targeted at bulk transfer, the achievable performance and limitations of the broadly used TCP protocol should be carefully investigated. Our first goal is to explore TCP's bulk transfer throughput as a function of network path properties, number of concurrent flows, loss rates, competing traffic, etc. We use analytical models, simulations, and real-world experiments. The second objective is to repeat this evaluation for some of TCP's replacement candidates: NETBLT and SCTP. This should allow an informed decision whether to put (or not) effort into developing and/or using new protocols specialized on bulk transfers.

Parallelization of the Effective Fragment Method for Solvation
Heather M. Netzloff and Mark S. Gordon , Ames Laboratory/Iowa State University

In order to accurately model the condensed phase, a large number of molecules are required. Since pure ab initio quantum chemistry calculations quickly become too computationally expensive, the Effective Fragment Potential (EFP) method for solvation has been developed. In the method, the system is divided into an ab initio region containing the "solute" and an "effective fragment" region containing "solvent" molecules. This research considers the fragment-fragment interaction energy calculation and its parallelization within the GAMESS program. Results show that reasonable speedup is achieved for electrostatic and exchange repulsion energy routines with a variety of sizes of water clusters and number of processors.

Architecture of a Real-Time Trigger Compute Farm Running at 1.17 MHz
A.Walsch, V.Lindenstruth, and M.W. Schulz, University of Heidelberg

We present the architecture of the Level-1 trigger of the future LHCb experiment at CERN. The system, a network farm of about 400 CPUs, has an input rate of 1.17 MHz and performs pattern recognition on the input data stream of 4Gbyte/sec. The performance results of our prototypes which are based on a two-dimensional SCI network is shown. We demonstrate that we are capable of sending data blocks of less than 200 bytes with more than 1 MHz. Additionally we present a scheduling network to guarantee flow control in our system.

Allreduce Performance and Performance Characteristics on Compaq and IBM Cluster Architectures
Patrick H. Worley, Oak Ridge National Laboratory

The allreduce collective operation is an important component of many parallel scientific application codes. We examine the performance of a number of different implementations of allreduce on the Compaq AlphaServer SC and the IBM SP, including the MPI_Allreduce supplied with the native MPI library. We describe how performance varies as a function of vector length and processor count, how important it is to choose the optimal allreduce implementation, and how the optimal implementation varies with how the allreduce is used. We also describe differences between average and best observed performance, and comment on practical implications of this difference.

A New Systolic Array for Symmetrizing a Hessenberg Matrix
B.S. Satish Kumar and G. Harish, Dr. Ambedkar Institute of Technology, Bangalore University

Nonsymmetric matrix problems, that are computationally more expensive than symmetric matrix problems, arise in many signal-processing applications. A symmetrizer is useful in converting a nonsymmetric eigenvalue problem into an equivalent symmetric one, the eigenvalues remaining the same, which is relatively easy to solve. Computation of such problems in real-time, on today's von Neumann machines, is pathetic. In this poster we present a systolic array to compute a symmetrizer of a lower Hessenberg matrix, that decreases time complexity and uses less number of processing elements compared to previous methods, viz. Leiserson, Double Pipe and, Fitted Diagonal methods.

V2001: High-Resolution Interactive Remote Visualization Hardware
Lyndon G. Pierson, Perry J. Robertson, Ron R. Olsberg, and Karl Gass, Sandia National Laboratories

High-resolution visualizations rendered on supercomputers are difficult to view at remote locations while maintaining high framerate and the low delay required for interactivity. Since video cables are short, this transfer of visual data is hard to accomplish even over local area distances. The V2001 solves this problem with high-speed, low latency compression hardware that interfaces to generic network interfaces and RGB/DVI video adapters. An adaptive compression algorithm combined with framerate reduction hardware provides interactivity of sufficient spatial and temporal resolution to allow for design and simulation tasks to utilize supercomputing resources while being performed remotely.

Distributed Shared Memory with Home Proxy Cache on RHiNET
Masaaki Ishii and Hironori Nakajo, Tokyo University of Agriculture and Technology; Toshiyuki Shimizu, Junji Yamamoto, and Tomohiro Kudoh, Real World Computing Partnership; Jun'ichiro Tsuchiya and Hideharu Amano, Keio University

Since distributed shared-memory is allocated in the main memory of a workstation or PC, a page of shared-memory is transferred to the cache of another node through its I/O bus which is between the main memory and the network interface. We have proposed a Home Proxy Cache which keeps copies of pages in home memory in order to avoid page transfer latency when there are accesses from the other nodes to the home memory. In this poster, the distributed shared-memory system with Home Proxy Cache, which has been implemented in the RHiNET network interface, will be presented.

FPMPI: A Fine-tuning Performance Profiling Library For MPI
William Gropp and David Gunter, Argonne National Laboratory; Valerie Taylor, Northwestern University

FPMPI is a wrapper library for the standard set of MPI functions which has been instrumented to gather performance information about the execution behavior of MPI programs. Its purpose is to aid systems managers and applications developers alike in identifying performance bottlenecks and to provide clues to optimizing an application or hardware configuration. It is simple, requiring only a relinking of existing MPI code for basic data gathering capability. The level of detail is controllable by directives specified at compile time. A companion visualization tool, FPMPIview is in development at NCSA to analyze the data produced by the profiling library.

Using Machine Descriptors to Select Parallelization Models and Strategies on Hierarchical Systems
Mark Yankelevsky, Walden Ko, Dimitrios S. Nikolopoulos, and Constantine Polychronopoulos, CSRD/University of Illinois

Clusters present the programmer with a complex hierarchy of hardware components, exploiting different levels of parallelism. The optimal parallelization strategy depends on several parameters, such as the number of nodes, processors per node, memory and communication bandwidth, and the overhead of orchestrating parallelism. A compiler using a detailed machine descriptor and static performance analysis can automate the selection of the best strategy. Experiments with the NAS benchmarks (parallelized using a combination of MPI and OpenMP) revealed performance patterns that drive the selection. Results and derived algorithms are presented in the poster and incorporated into the machine description of the PROMIS (HTTP:// compiler.

A Case For Proactive Directory Services
Fabian E. Bustamante, Patrick Widener and Karsten Schwan, College of Computing, Georgia Institute of Technology

Common to computational grids and pervasive computing is the need for efficient and scalable directory services that provide information about objects in the environment. We argue that an active interface directory services can improve scalability and provide useful service enhancements to potential applications. Specifically, the Proactive Directory Service (PDS) developed in our work supports a customizable interface through which clients can register for notification about changes to objects of interest to them. Moreover, the level of detail and granularity of these notifications can be dynamically tuned by clients through filter functions instantiated at the server or at object owners.

A Large-Scale Biologically-Realistic Cortical Simulator
E.Courtenay Wilson and Frederick C. Harris, Jr., CS Dept. Univ of Nevada, and Philip H. Goodman, School of Medicine, Univ. of Nevada

The object-oriented design for this simulator enables a flexibility in scale and modularity of the system, as well as modeling the relationships between neurons in a given network. We incorporate laboratory-determined synaptic and membrane parameters into a large-scale simulation, thus modeling realistic cortical modules. Currently we have constructed a cluster with 60 processors, 120 Gigabytes of RAM, and a Myrinet-2 interconnect network to support the parallel implementation of this simulator. Results show biological accuracy in synaptic and membrane dynamics, as well suggesting that computational models of this scope can produce realistic spike encoding of human speech.

An Experimental Study of Adaptive Application Sensitive Partitioning Strategies for SAMR Applications
Sumir Chandra, Johan Steensland, and Manish Parashar, Rutgers, The State University of New Jersey

While parallel/distributed implementations of structured adaptive mesh refinement (SAMR) techniques offer the potential for realistic simulations of complex phenomena, these implementation also present significant challenges in dynamic data-distribution and load balancing. This is because the choice of the "best" partitioning strategy depends on the nature of the application and its run-time state. This poster presents an experimental study of an adaptive application-sensitive meta-partitioner for SAMR applications that dynamically selects and configures partitioning strategies at run-time based on system parameters and current application state. Experimental results presented show that adaptive partitioning can significantly improve application performance.

The NCAR Spectral Toolkit: A High Performance Spectral Transform Framework
Rodney James NCAR

The first release of the NCAR Spectral Toolkit is presented, which includes new real and complex FFTs and generic functions to support the development of multithreaded and distributed spectral transform applications. New FFT algorithms are described that maximize data locality and register reuse to obtain high performance on superscalar RISC and EPIC microprocessors. Performance results from a fluid turbulence application built with the Spectral Toolkit framework are discussed.