``Practical Foundations for Transparent Interoperation''
Prof. Alan Kaplan
Clemson University
Department of Computer Science
Clemson, SC 29634-1906 USA
E-mail: kaplan@cs.clemson.edu
URL: http://www.cs.clemson.edu/~kapla
Prof. Jack C. Wileden
University of Massachusetts
Department of Computer Science
Amherst, MA 01003-4610 USA
E-mail: wileden@cs.umass.edu
URL: http://www-ccsl.cs.umass.edu/~jack
Scientific software systems, such as computer-aided design/computer-aided modeling applications, geographic information systems, earth observation systems, and bioinformatic applications, are increasingly focused on the exchange and integration of distributed information. Bioengineers, geneticists, biologists and other scientists are creating, using and managing ever larger amounts of ever more complex data. Moreover, rather than relying on the data produced by an individual application, scientists are becoming increasingly dependent on data originating from multiple sources. Such data may be produced by applications developed by a single scientist or various scientists in a particular laboratory, or more likely with the proliferation of the World Wide Web, acquired from other scientists. In a related manner, independently developed tools may produce individual data sets that need to be integrated in order to perform a required analysis. Legacy data, i.e., data produced by applications that are no longer accessible or maintainable, is also another source of information often required by scientists.
Developers and users of such data, however, are faced with a difficult tradeoff in the design and construction of software that creates and uses this data. They need to be able to model data so that it can be easily understood and efficiently used by an individual scientific application; at the same time they need the data to be in a form such that it can be used, integrated, and shared by other scientific applications engaged in integration of distributed data, even though the data might be described and defined using various formats, notations, models and/or languages.
Despite these problems, computer science has provided little foundational support to developers of scientific software systems. Instead, scientific applications traditionally have overcome such problems by resorting to relatively low-level techniques, such as using flat files, standard data interchange formats, IDL-mediated data exchange mechanisms or ad hoc wrapping of legacy data repositories. Such approaches tend to have various shortcomings. For example, standardized interchange formats typically require explicit translations, which are often inefficient and prone to error. IDL-mediated mechanisms, such as CORBA and DCOM impose foreign data models that represent a least common denominator type system [KRW97]. Furthermore, once a developer has committed to a particular IDL-mediated mechanism (e.g., CORBA, DCOM), changing to a different mechanism is extremely difficult and very costly. Another serious drawback associated with such approaches is that their use generally requires that developers and users be aware of the boundaries between various data repositories and the applications that need to access them. As a result, software based on these approaches that needs to access and manipulate scientific data is difficult to develop and maintain.
Our research is directed toward developing computer science foundations for transparent interoperation, in particular, exploring both theoretical and practical aspects of this problem domain. The primary objective of our work is to hide the boundaries or seams between heterogeneous data repositories or between data repositories and applications that need access these repositories. Developing appropriate theoretical and practical foundations for transparent interoperability results in software and data that it easier to develop, reuse, share and maintain.
A companion paper [WK98] appearing in this workshop outlines some formal models of type compatibility, type safety, and name management. In this paper, we give an overview of the practical aspects of our work, specifically the development of a new, highly transparent approach to interoperation, called PolySPIN. A collection of prototype, automated tools supporting the use of the PolySPIN approach, as well as our experience with their application, is also described.
Based on our formal foundations, we have been designing, developing, and experimenting with tools that facilitate interoperability. The PolySPIN approach provides a transparent interoperability mechanism for programming languages. More specifically, it provides support for polylingual interoperability [KW96] where applications can access compatible types defined in distinct languages as if they were defined in the language of the application. The fact that the types are defined and implemented in a different programming language is hidden from the application. Related to this mechanism is PolySPINner, a collection of tools that automates PolySPIN and supports type-safe polylingual interoperability [BKW96,K96].
Although approaches such as standard file formats, relational databases, IDLs (e.g., CORBA, OLD/DCOM) support certain aspects of polylingual interoperability, our approach offers several advantages over such mechanisms:
To help illustrate these advantages, we describe how the PolySPINner toolset can be used to create a Java genome sequence application that accesses and manipulates genome sequences defined using both Java and C++ type systems. The toolset takes as input the class definitions (i.e., interface and implementation) for the C++ and Java types. With respect to our scenario, there are Java and C++ type definitions for genome sequences. The types may match exactly, or more likely, there is an intersection of features that is relevant to the new Java genome application. In any event, PolySPINner first parses the type definitions for each language and then determines whether the types are compatible (where compatibility can be specified as discussed in [WK98]). If the types are deemed compatible, PolySPINner automatically re-engineers the implementations of all relevant operations according to the PolySPIN framework. The re-engineered operations include code that checks for the actual language implementation for objects and then invokes the appropriate operation implementation in the appropriate language.
At first blush, PolySPIN and the PolySPINner toolset provide similar functionality compared to popular distributed object technologies such as CORBA and OLE/DCOM. The primary advantages of our approach over these contemporary approaches are enumerated above. Although the details are beyond the scope of this position paper, our approach also hides the underlying interoperability technology. This means that IDL-mechanisms (such as CORBA and OLE/DCOM), as well as non-IDL mechanisms such as Remote Method Invocation and Java Native Interface can be potentially made transparent. Thus, a developer can change the underlying interoperability mechanism with minimal impact on existing applications.
In summary, our position is that interoperability should be transparent in distributed, complex software systems. Defining practical foundations for transparent interoperability (as opposed to using ad hoc and/or cumbersome approaches) permits developers of scientific applications to focus on the problem domain rather than on the underlying interoperability mechanism. We claim that this results in scientific applications and data that are easier and less costly to design, build, maintain and share.