``Frameworks for Distributed Query and Search in Scientific Digital Libraries''
Dr. Jakka Sairamesh
IBM T. J. Watson Research
Hawthorne, NY, 10532
E-mail: jramesh@watson.ibm.com
URL: http://www.cs.columbia.edu/~jakka/home.html

Prof. Christos N. Nikolaou
University of Crete and ICS-FORTH
Leoforos Knossou, Heraklion, Crete 71409, Greece
E-mail: nikolau@ics.forth.gr
URL: http://www.ics.forth.gr/~nikolau/nikolau.html

We present an architecture for storing, managing and presenting geographic and coastal information of various kinds to a variety of users. The primary goal is to provide transparent access to information stored in various databases spatially distributed. We first present a scenario and then the architecture. Our architecture is based on OMG CORBA services and Object frameworks for access control and repository services. This work grew out of work mentioned in [1,2].

Introduction

Suppose we have an extensive database of the physical, chemical and biological properties of the coastal region under consideration. This database includes the bathymetry of the region and various physical, chemical and biological properties of the water column. Such properties include such phenomena as currents, wave spectra, wind spectra, salinity, temperature, chemical and biological concentrations [1,2].

A typical query of the database might be phrased as follows: ``Find the region of 3d space within the given coastal region and the time interval, within which the concentration of a certain chemical or microorganism may exceed a certain value.'' Scientists may be interested in questions of this form in order to be better able to understand the physical and chemical processes in a coastal region. Local civil authorities may be interested in issuing permits for fishing or declaring certain coastal zones as health risks, inappropriate for tourism, swimming, etc.

More generally interrogation of the database can be thought of as a query of the type ``find a subset of a given set containing points with a specified property.'' From a logical set theoretic point of view, this interrogation operation is the computation of an intersection set created from two sets characterized by respective logical properties which are supposed to hold simultaneously for all elements of the set. This logical Boolean intersection operation is one of the most common DB interrogations and may require time consuming searches in very large databases. Here the sets involved are not amorphous clouds of discrete records but rather connected, smoothly shaped high-dimensional objects that represent certain multivariate functions.

Scientists interested in using existing programs (or algorithms, numerical techniques) to study the properties of the fresh data points (collected by the sensors) in the databases, and also look at previous research papers, and possibly some annotations, could ask for very complex queries. From the users view, the information system must provide a transparent view to the existing programs or numerical techniques, databases, and documents in an integrated fashion. This could mean searching for existing programs (which are indexed by keywords) and applying them to new data which could be located elsewhere.

Architecture

The web provides a simple way to represent and access documents, but for a large distributed system that needs to work together to solve the kinds of queries mentioned above, several issues of naming, access control, metadata, repositories naturally emerge. Though these technologies currently exist independently, an integrated solution to indexing, querying and presenting information objects customized to the various users of the system is still in its early stages [2].

Several efforts [1,2,3] are underway to solve these issues for a large spatially distributed database system. With the rapidly emerging Java enabled technologies, access to various legacy databases and legacy systems is becoming feasible via the Web. In addition, the Web is steadily embracing object-oriented frameworks such as OMG CORBA for better services and flexibility. Leveraging on these new technologies, the basic components of our architecture are the following:

  1. Metadata and indexing/searching services for GIS information such as areas/regions of maps selected by the users [2,3]. There are a few standards for Metadata definitions such as FGDC, developed by the US federal agency for digital geospatial data. The FGDC standard provides definitions for a few fields, along with their relations within a hierarchical structure.
  2. Naming and Repository services (servers). We use Naming services from OMG for locating and interacting with the information objects. The name services are very essential for efficient retrieval of objects. We envision, agents to help users search for objects by using the meta, name and index services. We plan to use OMG CORBA repository interface frameworks to access data objects located across the network.
  3. Interfaces for information sources such as underwater sensors, satellites (images) and aerial photography that provide data and image collection services. Also included are simulation models that populate the databases.
  4. QoS in search and retrieval: We address issues on QoS (quality of service) for search and retrieval of information objects. Our goal is to provide mechanisms for efficient usage of resources; such as processing, I/O and memory in the servers, and bandwidth and buffers in the networks. Novel performance based architectures are being investigated for optimized retrieval and presentation in large scale information systems.

Description of Architecture

Clients submit simple or complex queries via the world-wide web interface to the Digital Library system. The queries are submitted via Java enabled browsers. The system is accessible by scientists, engineers, local authorities and system administrators, but they all have different access restrictions. Users requests invoke various services such as meta, index and search to locate the objects which match the user-query. The agents (agents representing users) will fork various sub-agents to search in parallel across various databases (legacy and new) and collect/present the various information objects to the user. We assume that documents are stored in Digital Library or document databases, and metadata services are provided to access the documents.

From the scenario above, when a user selects a region of a map through the WWW browser, the coordinates of the region will be used to index the appropriate information about the region. This implies a metadata service that maps the regions of the map to the appropriate region-information. Therefore, users can zoom into a region of the map (or image) and query for various properties about the region or perform some operations on-line. It is likely that the information about a region could be dispersed across several database sites (for example the detailed image of the region will be stored separately from the data objects). For this service, distributed search queries will be sent to the various databases to obtain the objects. Metadata services describing the GIS objects will be used to index/search for the appropriate GIS objects (multi-dimensional data and images).

Conclusions

In this paper, we outline the issues in providing a distributed architecture for accessing the multimedia information from spatially separated scientific databases. We provide a OMG based object framework and architecture for scientists, engineers and administrators to access the scientific databases.

  1. C. Houstis, C. Nikolaou, M. Marazakis, N. M. Patrikalakis, R. Pito, J. Sairamesh, and A. Thomasic, Design of a Data Repositories Collection and Data Visualization System for Coastal Zone Management of the Mediterranean Sea, ECRIM Technical Report TR97-0200, June 1997.
  2. The Alexandria Project: Towards a Distributed Digital Library with Comprehensive Services for Images and Spatially Referenced Information, University of California, Santa Barbara. http://alexandria.sdc.ucsb.edu/.
  3. An Electronic Environmental Library Project, University of California, Berkeley. http:// elib.cs.berkeley.edu/.