``What is Special About Scientific Data?''
Prof. Peter Buneman
University of Pennsylvania
Philadelphia, PA 19119
E-mail: peter@cis.upenn.edu
URL: http://www.cis.upenn.edu/~peter
Scientific databases have a history that is as long as the history of database technology. It is therefore surprising to find relatively little penetration of database technology into the field of scientific databases. Moreover there appears to be very little technology that is common to scientific databases in general. Why is this?
To answer this question one has to look at how scientific data differs from ``business'' or ``administrative'' data upon which database technology has had a considerable impact. Most scientific data sets follow a data/metadata paradigm. The data is typically an image, a time series, a nucleic acid sequence, etc. -- the result of some experiment or survey. The metadata typically consists of ``administrative'' data, such as where and when the image was recorded and perhaps some data on the experimental conditions. While administrative data can usually be represented cleanly in a relational database, the data itself cannot. Traditional relational systems cannot deal with the storage requirements for these special data types, nor can they accommodate the associated operations in the database query language.
The picture has changed somewhat with the development of object-oriented and object-relational systems where there has been some application of database technology to scientific data[3,4,5]. However, given the ten year history of these database management systems, the impact has not been great for systems that have effectively overcome most of the limitations of relational databases.
There are, I believe, other reasons why database technology has not penetrated scientific computing. If the metadata is relatively simple and there is no need for concurrency control (typically scientific data sets are archival) then the added functionality of a database management system, when compared with a simple indexing system, may not be worth the effort. But there are also two more fundamental reasons that I see in at least molecular biological databases.
First, the metadata in these databases is more than ``administrative'' data. It contains annotations that individuals have added in the process of analyzing and correcting the data. These annotations may have a complex structure and that structure may change. Even in object-oriented database management systems, that structure may be difficult to capture, and managing change in structure may be very difficult in an object-oriented database management system.
Second, what, if any, database management is used may not be the problem. The problem is integration. A ``database of databases'' [6] lists some 400 databases of general interest to researchers. This figure is growing at some 30 databases a year. Each database is heavily annotated or ``curated'' and typically deals with some aspect of research. One finds databases centered around diseases, organisms, or some genetic control mechanism. Because of the volatility of structure mentioned earlier, there is very little hope of producing an all-embracing database. But any given line of research will typically require access to several of these databases. Database technology is only now being developed to solve the integration problem.
At Penn we have been developing database integration technology that is robust with respect to structural changes and does not require the data to be in conventional databases. It can deal with the formats in which the data is currently held, typically some kind of structured text. It is based upon a new approach to database transformation that can handle the complex data types found in scientific databases [1] and we have had some success in applying this in bio-informatics [2]. The software is generic in the sense that once we have built an interface for the formatting convention or database interface used by the source there is no need to do further work when we are given a data source that conforms to that interface.
The issue of genericity brings me to the observation that each scientific discipline appears to be developing its own standards and its own formats. While it is true that the data (as described above) may require some special representation, there is no obvious need to develop special-purpose formats for metadata. There are several generic formats and data exchange protocols such as ASN.1, XML, CORBA that can readily be adapted to handle data as well as metadata. Coming to some consensus on interchange protocols and formats will simplify integration problems and will, perhaps, enable more interdisciplinary scientific research.