Workshop on Software Tools for HPC Systems
Technical Infrastructure for a National Software Tools center
leftright
Working Group Members
Introduction
Objectives and center Inputs
Products of the center
Project Selection
center Characteristics
center Resources
center Staff Skills Mix


Working Group Members
Technical Infrastructure for a National Software Tools center
  Don Austin     NCO
  Rod Oldehoeft   DOE
  Dan Reed   Illinois
  Thomas Sterling (Chair)     Caltech
  John Toole   NCO
  Bob Voigt   NSF



Introduction

This working group addressed the technical infrastructure issues associated with the formation of a national Software Tools center. Several working assumptions underlie the observations and recommendations of this report:

Advances in parallel architectures have not been matched by the needed improvements in software capabilities. As a result, applications development usually proceeds via "heroic" efforts instead of normal development practices generally followed for software on sequential, vector, or smaller-scale parallel machines.
In spite of the small niche occupied by truly high-performance computers, the applications uniquely possible on these systems are extremely important for advancing science, developing industrial competitiveness, and supporting defense needs. Hence the software difficulties in this niche must be addressed as a national priority.
New approaches are being developed among the research communities in universities and national laboratories for high-performance system and tool software. Often these proofs-of-concept show exciting potential for improving the current situation. However, these communities have primary responsibilities to their applications, or to software research as opposed to production. The resulting software is often characterized as rudimentary, brittle, poorly documented, and isolated from other software. As a result, these are not widely adopted and, worse, are often re-invented for other sites and new projects.
A National Software Tools center will be useful in providing the means for selected experimental codes to be transformed to usable robust software tools. The initial codes will be derived from research groups around the country. Multiple stages of maturity will be specified and target codes identified for transition across successive stages. As understanding of the potential value of each maturing tool grew through use of early releases, additional resources will be applied to continue the advance, possibly to the point of full commercialization. The nature and structure of such a center has yet to be established.

This report documents the findings and recommendations from working group deliberations. The following sections provide discussions about center objectives, the products from the center, criteria for how projects may be selected, characteristics of the center, resources required, and the recommended skills mix for center staff.



Objectives and center Inputs

The purpose of the proposed National Software Tools center (referred from here on as "the center") is to dramatically enhance the state and utility of high-performance computing through increased availability of essential and advanced software tools. The term "tool" is applied here in the broadest sense and refers to any software that extends the capability, usability, and efficiency of high-performance computing systems in the development and performance of end-user applications. Such tools include but are not limited to compilers, run time systems, operating system components, debuggers, performance profilers, and tools for data management, scientific visualization, communication, fault management, and software integration. The global vision behind the center is to provide the necessary infrastructure to select appropriate software outputs from research projects and carry them forward to a form suitable for use by the general HPC community. This includes the possibility of creating software products not being addressed by any research groups. The center is not necessarily to be a single location, but may engage talents at diverse geographical locations and across administrative domains.
 
Activities at the proposed center will be driven by several inputs:

A major driver will be the early proof-of-concept codes from experimental projects in software tools research. These may come from any cooperating research organization including universities, national laboratories, other not-for-profit research institutes, and even computer vendors. However, no source may impose proprietary considerations that would limit availability of any center results to the general HPC community.
A second somewhat different source are prototypes of tools assembled by users where the primary goal is the application, but where ad hoc efforts produced inchoate tools. This is an example of a broader input source: requests, explicit or implicit, from the user community for tools that satisfy recognized needs. This class of input may be source code or precise functional specifications.
Finally, the center staff will also identify needs in-house that require development of new tools. Here, only a specification instead of an initial test code is available at the outset of the resulting project.



Products of the center

The end products of the center will undoubtedly take on many forms, resulting from the varieties of inputs, and the functionality desired. Therefore, a scale of intermediate products will be supported by the center. While there is no intent to constrain the product types or the degrees of robustness the center's products may have, there are some major identifiable categories that typify the kinds of tools likely to be produced. These can be distinguished by completeness or assumed reliability.
 
1. Early Evaluations
 
An initial but important product of the center is the early evaluation. The result is not usable code, but instead a detailed critique of the merits of the concept, approach, implications, and implementation methods exhibited in an early design or prototype. This is a formal process that makes available to another group an objective and detailed assessment of a new project, and its potential for addressing key problems within the HPC software tools arena. These evaluations will be made available to the research groups involved to assist in guiding and influencing direction at an early stage. Among other contributions, the center can alert researchers to other work in the specific field they are pursuing and compare the intended contribution with other efforts. This particular class of product from the center enables the center to assess the potential merits of some future collaboration with the target research group and project.
 
2. Improved Prototypes
 
Perhaps the single most important product of the center will be the improved prototypes of research codes and tools. Indeed, this was the original idea that sparked the genesis of the center and would alone justify its creation. The intent of the improved prototypes is to bring the potential functionality of research experimental codes to a high enough level of reliability that they can be used by a select "friendly" community to use and evaluate. The improved prototypes, while not bullet-proof, will be expected to work as intended under most operational circumstances. The center will modify and augment the original research tool until it reaches center standards of quality for prototype release. Documentation for installation, interface, and use will be an important element of the improved prototype. In addition, test cases will be provided to end users to determine correct operation after installation. The center will provide support, tracking and fixing bugs as reported in the field. The center will also establish, for each improved prototype, a user evaluation and reporting database to collect assessments from users. This information will determine future efforts towards further improvement, as well as continued support. Many levels of "improvement" will be possible by the center, thus providing flexibility in taking on new projects and meeting the needs of the community. This will permit better use of center resources, allow more experimental tools to be engaged, and reduce the necessary level of effort to moving a given prototype tool to the next stage of development. It will be the responsibility of the center to establish a framework for defining these levels of improvement and procedures for managing tool development through these successive stages.
 
3. Reference Implementations
 
A few research software tools will prove to be of high enough quality and value that their widest possible distribution will be imperative. Possible commercial implementation by major system vendors or independent software vendors might be appropriate and the ultimate goal. To support the commercialization as well as the early availability of such codes, a high level of robustness, specification, and documentation will be achieved in a center product to provide a reference implementation. A reference implementation is a self-defining specification of functionality and interface as well as a fully operational tool. Users of reference implementations of tools from the center can expect them to be of high enough quality to be used on a production basis and can install them among their main software tools. Once a tool has reached the level of reference implementation, additional changes to functionality will be rare and will be reflected by controlled version numbers; this to retain uniformity of different vendor implementations and manage user expectations.
 
4. Conventions and Standards
 
A major challenge to the software tools community has been the collection of commonality characteristics that enable portability, interoperability, generality, and functional uniformity including user interface. The unfortunate alternative is a collection of isolated and unrelated tools unable to exploit the capabilities of others or, in ensemble, function as a higher-level complex system. To support better code reuse and to enable tools that exploit other tools' capabilities, a set of interface standards will be devised by the center to specify conventions for interplay of tools. Tools crafted to comply with such standards will be more easily integrated into a powerful and evolving ensemble. New tools will be fabricated more quickly because developers may reuse existing and accessible functionality. Developers will realize a larger immediate user base as the community more readily adopts compliant tools. Such a set of conventions will expose gaps and opportunities for future advancement of capabilities in much the same way that the Periodic Table, once incomplete, exposed plausible but undiscovered chemical elements. The center, out of necessity, will develop conventions for interoperability of the tools built in-house from research codes. De facto standards will sometimes be established as guidelines to future development. These will be shared with the HPC community and, where appropriate, used by the community in general.
 
5. Education
 
Even without specifying its characteristics, the center will clearly be well positioned to play a role in education. It will surely provide instructional mechanisms and materials related to the software with which it is involved. Beyond that, it may contribute material related to the education of future computational scientists and users of high-performance computing systems and tools. Possible forms are varied: tutorials for the use of HPC tools; curricular elements developed with educational institutions for preparing future computational scientists; books and other documents focused on this narrow but important field. Defining the exact role for the center to play in education is a task for future inquiry.



Project Selection

An important aspect of center operations is the actual selection of specific projects to be undertaken, driven by the many opportunities provided by the research communities, and the needs of the HPC user community. If successful, the center will have positive influence over the evolution of HPC software tools, based on which projects it selects to advance by applying its resources. This significant responsibility demands selection processes and criteria that both represent and support the research and user communities. Criteria will unavoidably conflict, and difficult choices for expending limited center resources are inevitable.
 
Among many possibilities, several criteria are identified for discussion here.
 
1. Strategic fit with center objectives
 
The objective of the center is to put the best tools of greatest importance into the hands of users as rapidly as possible. Underlying any possible selection process is a basic center strategy or model of what composes an effective software tool set. This is in turn driven by a conception of the requirements and state of the art. Proposed projects which most closely fit this model are more likely to be sponsored and actively pursued.
 
2. Potential impact
 
Within this conceptual framework, the potential short term and long term impact will be assessed. A major driver of the selection process will be those factors that are expected to deliver the greatest ultimate value to the research and user communities.
 
3. Innovation
 
Innovative concepts and approaches are critical for rapid advance in this emerging field. The more radical or advanced the approach, the more likely it is to contribute to establishing new paradigms for managing HPC system resources. However, innovation must be tempered with practical considerations of utility and compatibility. Nevertheless, the benefits of novel tools may outweigh the inconvenience and disruption to conventional but less productive user practices. The most valuable tools will be those which provide new functionality that fill recognized needs, but which complement and interoperate with the community's existing base of tools.
 
4. High quality of incoming software
 
The level of effort required will be a strong function of the quality of the original research code to be enhanced. Where good software engineering practices were used in the development of the initial experimental tool software, the likelihood of success, with lower center investment, is enhanced. Conversely, a large morass of undocumented spaghetti code would require much more center effort, and so is less likely to be selected. Level-of-effort concerns also favor small, modular projects instead of grandiose "we've solved the whole problem" projects. In part, the conciseness of the project objective is likely to be reflected in well-crafted codes instead of huge software dreadnoughts.
 
5. Maturity of prototypes
 
Of course, level of effort will also be sensitive to the degree of the work already achieved by the originating research group. The more advanced the effort, the easier it will be for the center to transform it into a robust prototype. Further, the confidence in capabilities and potential impact for an advanced project is enhanced, which can make selection more likely. More mature codes will have had more extensive use and evaluation, the results of which will influence the review and selection process. Codes capable of strong and rich demonstrations will be favored over early breadboard codes that have been exercised in limited ways.
 
Naturally, there must be a balance here. It can not be the unintentioned implication of the selection process that research groups are forced to do the job that the center was established for in the first place. But where choices must be made, those projects most likely to lead to success for the HPC user community can be expected to be favored. That includes the confidence and level of effort in the project, both of which will be influenced by degree of completeness of the original research tools code.
 
6. Potential for fruitful interaction with producer groups
 
It will not be practical for center staff to be fully versed in every detail of the initial research code. Its quality must in part be surmised from knowledge of the originating research group and its past accomplishments and products. This may appear to favor the well-established and better-funded research groups, which is certainly not the desired outcome. It is, however, reasonable to favor projects from groups with strong and productive track records. As graduate students and postdoctoral researchers from successful groups diffuse throughout the broader community, their reputation will follow them to raise the standards across the domain. A long term consequence of this necessary bias is that the standards of code quality will rise not a bad thing in the long run.
 
These criteria have differing weights when applied to the several classes of center products.
 
The objective of Early Evaluation projects is to accelerate the advance of innovative ideas in constructive directions. The focus will be on inchoate projects which may have less experimental code to demonstrate but novel yet solid concepts to present. The low level of effort required by the center to assess the merits and provide constructive recommendations means that the selection criteria will be more heavily weighted toward potential importance as well as quality of documentation than toward other criteria. It also provides an early look at a particular project at a time when direction may be strongly influenced. Any project that has gone through this process is more likely to be selected at a future time as a target to improve the early prototype by the center, as it will more likely reflect the center's basic model.
 
Projects selected for developing Improved Prototypes will be subject to additional considerations, including the continued participation of the producers. A close relationship between the developers and the center is essential for successful technology transfer to the center. The complexities of incompletely documented, highly experimental code will make code understanding dependent on tight collaboration with the producing group. If such a relationship is not feasible, the project will be less attractive as a target for prototype improvement.
 
Another factor in the decision will be the kind and degree of improvement necessary to bring the code to the next stage of utility. This, combined with its potential impact related to functionality, will determine how quickly a new useful tool can be brought to the community.
 
Reference Implementation projects will be selected according to the criteria presented above, but include additional factors associated with the likelihood of vendor participation. The value of a reference implementation is in a de facto standard of functionality, so that vendor implementations will achieve identical interface, interoperability, and equivalent behavior. Robustness, completeness, internal consistency and specification documentation are critical characteristics of a center project intended to serve this role. The significance of the responsibility implied by these requirements dictates that a high level of effort will be required by the center. Few such projects are anticipated, and certainly no more than two of them is expected to be engaged concurrently by the center. Therefore, selection will be stringent and depend on a high probability of success. A key component to that will be the participation of the vendor community in its evaluation and endorsement of the end product. One or more vendors will be required, a priori, to show strong willingness to consider internal advanced development and product distribution if the project is to be undertaken by the center. This will mean that the contribution to be made by releasing the reference implementation is clear and compelling. Such evidence will come from use by parts of the community of earlier advanced prototypes of the research code previously developed and released by the center. Additional issues of ownership and reference version control must also be resolved before project initiation.
 
Standards and Conventions are of a different nature than the other center product types. These are frameworks or conceptual infrastructure that enable software tools communities and their products to work together and to provide a necessary level of stability to the end user community. Selection of efforts to establish such standards or conventions will be derived from perceived need both within the center and by the community. They will emerge, sometimes unintentionally, as a natural consequence of just trying to get the center's jobs done.
 
Selection will involve a mix of contributors. The primary and final decisions will be made by center management. An external advisory board will be established from representatives of the sponsoring agencies and the HPC community. This continuing body will advise on all selections made, especially in establishing priorities and tradeoff criteria. Vendor representatives will be consulted for selection of projects to develop reference implementations or standards. The selection decision process will involve two stages: merit for selection, and priority for resources. The first stage judges the candidate project on its own intellectual and functional merits. The second stage determines its competitiveness relative to other potential projects and finite center resources.



center Characteristics

A close relationship with the HPC tool developing and using communities is crucial to the center's success in advancing the state of software tools and high-performance computing. Access to external talents in both domains is critical to extending the effective capabilities of the center beyond those encompassed by the immediate center R&D staff. Hence the center should be co-located at a national HPC host site such as a DOE national lab, an NSF supercomputer center, or a NASA center with strong computational programs. Another reason for establishing the center in such a context is the availability of several large computing systems, and the presence of an independently maintained infrastructure. Thus immediate and easy access to end users, expertise, and resources strongly supports hosting the center at an established HPC institution.
 
While the center will be co-located at an HPC host site, it must be independent of the hosting institution with regards to management authority and mission direction. The center will not be perceived to be owned or unduly influenced by the host. Rather, the relationship can exploit the potential synergism through mutual exchange of ideas, talents, and resources. The center will reimburse the host site for use of its facilities. Participation in center activities by host site personnel will be arranged on a case-by-case basis and will most likely be unfunded collaborations.
 
While a single centrally located monolithic organization is one possible form of the center, several considerations lead to an alternate form being adopted. Access to a diversity of HPC platforms is less likely to be achieved at a single site than at a collection of separate sites. For example, the NSF Supercomputer centers collectively provide access to several types of machines among them. Computational centers often tend to focus on specific classes of application most closely associated with the mission of its sponsoring agency. For example, the DOE national labs and the NASA centers involved in high-performance computing are engaged in distinct applications, although many of the underlying algorithmic principles overlap. Finally, the best talents in system software are found among several organizations, not in one place. For these reasons, the center will take on less of a form of a "center of Excellence" in favor of a structure more like a "Circle of Excellence" by distributing the center organization across several geographical sites. It is proposed that the center comprise three or four distinct but strongly coordinated sub-centers, each located at a separate location and host site. This will give access to a diversity of resources, talents, and user requirements, as well as help better focus on the distinct missions of the multiple sponsoring federal agencies.
 
Of the three or four sub-centers, one will take on the additional responsibilities of administrative and coordinating duties as well as operating as interface to the external sponsoring agencies. However, all sub-centers will engage in the technical process of selection and execution of center projects. Each sub-center will be managed by a center deputy director with a center director having general responsibility for the ensemble. Projects will usually be allocated to a specific sub-center instead of attempting to distribute the workload across sub-centers at a finer granularity. Proximity of co-workers leads to rapid progress and serendipitous discovery. Matching of project to sub-center will be determined by several factors including workload, relevant talents and resources, and possible proximity to the producing research group. Other issues may come into play as well on a case-by-case basis.



center Resources

The focus of the work of the center is the development, enhancement, and testing of innovative software tools for high-performance computing, so this perforce involves use of high-performance computers. Because the tools under development will often be dangerously shaky or interface with low level mechanisms buried deep within operating system kernels or device drivers, direct and full access to, as well as control of, HPC systems will be required. At the same time, access to large systems will be essential to verify correct operation at scale and to determine scaling properties. While generous funding by sponsoring agencies is anticipated for center functions, it would be impossible and inappropriate for funds to be expended on one of every kind of HPC machine in its largest possible configuration.
 
This conflict of needs and realities will be satisfied by a mix of small, program development systems being acquired and placed at the sub-center sites for exclusive use by center technical personnel. These development systems will not be expected to provide a robust and uninterrupted user environment, but can be the target of disrupting low level system software and tools development efforts. It is expected that each of the major vendor platforms will be represented by one of the sub-center development systems. No two sub-centers is expected to have the same class of development system and therefore a means of distributed sharing across the center is essential. Such mechanisms and system administration issues to make this both possible and easy to use must be established by center management.
 
The need for access to large configurations of HPC platforms will be satisfied by the host sites. Each such site will be selected in part for its large high-performance computing facilities. The locations of the sub-centers will chosen to maximize the diversity and size of the host systems available. Although the small development systems will incur most of the down time resulting from the experimental development and testing cycle, there will be periodic requirements for the entire host site system to be made available to center project teams. Support by the host site of such intrusive activities will be an important criterion in site selection for sub-centers.
 
In addition to the small development HPC platforms, the center and its constituent sub-centers will own several other computing resources to enable their missions. A heterogeneous collection of workstations will be procured and updated over time. These workstations are necessary both to support the day-to-day computing requirements of the personnel and to provide test platforms for code under development. Many software tools engage both HPC systems and user workstations, sometimes in complex ways. Occasionally, software tools may interact with proprietary products of specific workstation vendors. Graphical user interfaces (GUI) to HPC software tools are usually executed on user workstations. Also, an important class of high-performance computing systems is "clustered computing" using ensembles of loosely coupled workstation-class systems. Thus an important and integral element of the sub-center environment will be its rich collection of workstations.
 
Other important support resources include a high-bandwidth network, file servers, backing store (tertiary storage), printers, and Internet connection. To some degree, the sub-centers may be a customer of the host site resources to partly satisfy these requirements. In other cases, the sub-center is likely to own the resources it uses. These decisions will be made on a case-by-case basis. For example, it is anticipated that the sub-centers will have large data storage requirements for handling data sets resulting from software experiments and other aspects of center operation. This requirements may be best satisfied by center ownership.
 
Besides substantial and diverse hardware resources, center objectives require a substantial base of software resources as well. Commercially available software development tools will be an essential component of the software base both for direct use by developers and as targets for interoperability of experimental tools under development. Source code for target machine operating systems and compilers will be critical for providing direct access to low level functionality in the support environment. Mechanisms for accessing protected resources, such as hardware counters, are essential for development of certain types of tools.
 
Finally, each sub-center must be independent in meeting its daily operational requirements. This implies the need for full environmental support for managing paper work, organizing meetings, presenting material, and providing the usual personnel support functions. Administrative and secretarial resources must be provided in sufficient quality and quantity that technical staff are not distracted from their principal occupations.



center Staff Skills Mix

Management of the center will be limited to essential functions for directing the processes of the center, administrative and logistical support, and interfacing with the HPC community and host sites. Overall center management will be provided by the center Director, who will be primarily responsible for coordinating with the center Advisory/Steering Committee, establishing direction and procedures, and maintaining relations with sponsoring agencies. The center Director will also make all final decisions about new project starts and continuation of on-going projects. However, these decisions will be made in consultation with the review process of the center Advisory/Steering committee and based on recommendations of the center technical staff. Every sub-center will be under the direction of its Deputy Director, who will be responsible for the operation of the sub-center, relationship with the host site, and the progress and quality of the projects. The Deputy Director will be supported by the Chief Administrator and the Chief Scientist of the sub-center. The Chief Administrator will manage the administrative support staff and all budgets. The Chief Scientist will oversee all technical projects of the sub-center as well as conduct specific projects.
 
The principle function of the sub-center will be the development and enhancements of HPC software tools; this will be carried out by the sub-center technical staff. At least 50% of all personnel will comprise the technical staff. A mix of expertise and capabilities will be represented by the permanent technical staff. Such backgrounds will include operating systems, compilers, GUIs, scientific visualization, evaluation and instrumentation, computational science, parallel application programming, and modern software development practices. Projects will be conducted by teams of members of the technical staff. Each project will be directed by a team leader who will be dedicated to that task. However, while most members of the team will be focused on the single task as well, individuals with specific talents critical to more than one project may be shared, workload permitting. It is paramount that all members of technical staff be well versed in modern software practices. Training in this area may be required and provided for new hires.
 
Staff will be necessary to provide important support services. This goes well beyond the typical secretarial support ordinarily found in any organization. Because of the importance of the complex computing facilities accessible from the sub-center, substantial systems administration personnel will be located at every sub-center. These people will be challenged by the conflicting needs of providing robust capabilities while making systems available for risky experiments likely to cause individual systems to fail while under test. The center will be responsible to the user community for the software tools it releases. A permanent and well-staffed user help desk will be supported by each sub-center to maintain the software tools and provide advice to users. The user help desk will interface to the user community for managing all bug reports and providing rapid response when possible. These staff will constantly be learning new tools as they are developed and will work with the technical staff as software tools are being readied for release as advanced prototypes. Documentation is essential for conveying functionality, interface requirements, and means of use of new software tools. The sub-centers will include permanent technical writers on staff who will work closely with code developers to provide the necessary documentation to the user community. The center, although not every sub-center, will engage the services of legal counsel on a continuing basis to deal with issues of ownership and liability related to experimental software tools.
 
A significant number of people at a sub-center at any particular time will be visitors, for several reasons related to the objectives of the center. Users from other institutions will visit to convey needs and to assess the merits of emerging software tools. Original developers from other groups whose codes have been selected for center projects will be on-site to help in technology transfer both for explaining the details of their code and in receiving critique related to the merits of their codes. Consultants with specific expertise necessary for a given project will be housed at the sub-center. Representatives from industry and vendors will be on-site to work with tools developers, especially in the case of reference implementations. Students, postdoctoral researchers, and faculty on sabbatical will be important visitors to enrich and diversify the interests and capabilities of the center community.

leftright