Yu

``Searching Information from Multiple Sources''
Prof. Clement Yu
University of Illinois at Chicago
Department of Electrical Engineering and Computer Science
Chicago, IL 60607-7053
E-mail: yu@eecs.uic.edu
URL: http://www.eecs.uic.edu/eecspeople/yu.htm

Currently, enormous amount of information is generated by numerous people. As a consequence, information is distributed and stored in various sites. The Internet is one such example. In order to retrieve useful information in response to a query, a naive method is to broadcast the query to all sites. Then the search engine at each site processes the query and retrieves a set of documents. These documents are merged and then an appropriate ranked list of documents is presented to the user. The method described above does not make use of system resources efficiently, because the query may be sent to many sites which do not contain useful information. This is a waste of communication resources. In addition, the search engine at those sites will need to process the query. They may even return some useless documents. Finally, the transmission of useless documents and their subsequent merging waste further system resources. In order to reduce the waste, the contents of each database is represented by a representative. When a query is submitted by a user, it is compared to the representatives of the databases. Then the system will estimate the number of documents in each database which are most similar to the query. Based on these estimates, only those databases which contain sufficiently large number of most similar documents will be searched. The research issues we have been studying consist of:

What information should be contained in a database representative?

How an accurate estimate of the number of most similar documents in a database to a query can be obtained?

How can such estimates be obtained efficiently?

What is a minimal information that should be used in a database representative? What is the trade-off between the space required to store a database representative and the accuracy of the estimate described above?

Suppose a standard for each site to supply the information needed to construct the representatives cannot be reached, how should the system respond to the user queries?

Many sites are autonomous and employ different similarity functions to retrieve documents. Which documents should be retrieved by a site and be consistent to ``the global similar function'' desired by the user?

While some sites may be willing to disclose their similarity functions to the ``global system'' in order to help the system to utilize system resources efficiently, other sites may be unwilling to collaborate. How can the global system ``discover'' such information without explicit help from these local systems?

The issues described above have been studied to some extent for text databases. Do the partial solutions for text databases apply to image databases as well? How about structured databases such as relational or object-oriented databases?