SOURCE: International Science Grid This Week
At an informal SC08 discussion titled "Traditional & Distributed HPC–What has changed? What remains the same?" participants — both users and developers — shared experiences and insights on the advancement of grid-based High Performance Computing. Led by Gabrielle Allen and Daniel S. Katz of the LSU Center for Computation & Technology, and Gary Crane, SURA Director of IT Initiatives (including SURAgrid), the group identified several areas in which it believes grid technology needs to advance in order to deliver cost-effective service.
Management of distributed data
Many research domains require data from diverse sources, and output from one application is frequently used as input to another. Data definitions that cross application boundaries are rare, making sharing between applications difficult or impossible. Participants noted the progress in relevant standards development by the Common Component Architecture group. They also cited caBIG as an exemplary data-driven infrastructure and recognized optical networks as positive underpinning for effective management and use of distributed data.
Standards for abstraction of middleware layers
Abstraction refers to making the workings of these layers transparent to users so that the grid is easy to use. Standards are needed to support both automatic discovery of resources for job execution and intelligent data transport. Along with easier — even automatic — job submission, these standards are critical for wider adoption. The group believes that the grid needs coordinated direction for development of sophisticated, distributed file systems and schedulers to manage the dynamism of multiple jobs, users, resources and administrative domains. Virtual Machine technology can provide some of this through its dynamic, load-based resource sharing.
![]() |
The SURA Coastal Ocean Observing and Prediction (SCOOP ) Program is integrating distributed data and computation to improve storm surge forecasting and model visualizations on www.OpenIOOS.org. Image courtesy of Joanne Bintz, SURA |
Measuring cost-effectiveness
Researchers weigh the costs of using distributed versus local resources. They'll tailor their application to use remote resources when the lower wait and execution times compensate for the extra work. However, the decreasing cost of HPC hardware is making local resources easier to acquire (more local resources mean less wait time), and MPI jobs (jobs that need to communicate with each other during execution) still work best locally.
In traditional (non-distributed) HPC, it's easy to measure cost-effectiveness using system performance metrics. Similarly useful metrics for distributed computing don’t exist yet. Funding is needed to develop these performance metrics.
Industry still hasn't found the economic driver for distributed computing, and is not investing in it. This may be changing, given recent developments from Google, Amazon, IBM and Microsoft. However, other funding is needed until there is more widespread support, particularly for scientific computing.
—M.F. Yafchak, SURA