Roy Williams

``Questions and Recommendations for Scientific Digital Archives''
Dr. Roy D. Williams
California Institute of Technology, 158-79
Pasadena, CA 91125
E-mail: roy@caltech.edu
URL: http://www.cacr.caltech.edu/~roy

Who are the Librarians?

There are many creators or users of a scientific data archive, but few librarians or administrators. So the question we are left with is who are the librarians for this increasing number of ever more complex libraries? Who will be ingesting and cataloguing new data; maintaining the data and software; archiving, compressing and deleting the old. Somebody should be analyzing and summarizing the data content, assuring provenance and attaching peer-review; encouraging interaction from registered users and project collaborators. Perhaps the most important function of the librarian is to answer questions and teach new users.

How Long Will the Archive Last?

Scientific data archives contain valuable information that may be useful for a very long time, for example climate and remote-sensing data, long-term observations of astronomical phenomena, or protein sequences from extinct species. On the other hand, it may be that the archive is only interesting until the knowledge has been thoroughly extracted, or it may be that the archive contains simulations that turn out to be flawed. Thus a primary question about such archives is how long the data is intended to be available, followed by the secondary questions of who will manage it during its lifetime, and how that is to be achieved.

Data is useless unless it is accessible, unless it is catalogued and retrievable, unless the software that reads the binary files is available and there is a machine that can run the reader programs. While we recognize the finite lifetime of hardware such as tapes and tape readers, we must also recognize that files written with specific software have a finite lifetime before they become incomprehensible binary streams. Simply copying the whole archive to newer, denser media may solve the first problem, but to solve the second problem the archive must be kept ``alive'' with upgrades and other intelligent maintenance.

A third limit on the lifetime of the data archive may be set by the lifetime of the collaboration that maintains it. At the end of the funding cycle that created the archive, it must be transformed in several ways if it is to survive. Unless those that created the data are ready to take up the rather different task of long-term maintenance, the archive may need to be taken over by a different group of people with different interests; indeed it may pass from Federal funding to commercial product. The archive should be compressed and cleaned out before this transformation.

Text-based Interfaces

While the point-and-click interface is excellent for beginners, mature users prefer a text-based command stream. Such a stream provides a tangible record of how we got where we are in the library; it can be stored or mailed to colleagues; it can be edited, to change parameters and run again, to convert an interactive session to a batch job; the command stream can be merged with other command streams to make a more sophisticated result; a command stream can be used as a start-up script to personalize the library; a collection of documented scripts can be used as examples of how to use the library.

We should thus focus effort on the transition between beginner and mature user. The graphical interface should create text commands, which are displayed to the user before execution, so that the beginner can learn the text interface.

In a similar fashion, the library should produce results in a text stream in all but the most trivial cases. The stream would be a structured document containing information about how the results where achieved, with hyperlinks to the images and other large objects. Such an output would then be a self-contained document, not just an unlabeled graph or image. Because it is made from structured text, it can be searched, archived, summarized and cited.

XML as a Document Standard

Extensible Markup Language (XML), which has been developed in a largely virtual W3 project is the new ``extremely simple'' dialect of SGML for use on the web. XML combines the robustness, richness and precision of SGML with the ease and ubiquity of HTML. Microsoft and other major vendors have already committed to XML in early releases of their products. Through style sheets, the structure of an XML document can be used for formatting, like HTML, but the structure can also be used for other purposes, such as automatic metadata extraction for the purposes of classification, cataloguing, discovery and archiving.

A less flexible choice for the documents produced by the archive is compliant HTML. The compliance means that certain syntax rules of HTML are followed: rules include closing all markup tags correctly, for example closing the paragraph tag <p> with a </p>> and enclosing the body of the text with <body> ... </body> tags. More subtle rules should also be followed so that the HTML provides structure, not just formatting.

Authentication and Security

There is a need for an integration and consensus on authentication and security for access to scientific digital archives. Many in the scientific communities are experts in secure access to Unix hosts through X-windows and text interfaces. In the future we expect to be able to use any thin client (Java-enabled browser) to securely access data and computing facilities. There should be (at least) access-control levels as follows:

Public access: anybody on the Internet can get data in the public area, and they can also run certain query engines that use limited resources. This area also contains the ``home-page'' for the project with the usual items.
Low-security access: this area can be accessed with a clear-text password, control by domain name, HTTP authentication, a password known to several people, or even ``security through obscurity''. This kind of security emphasizes ease of access for authenticated users, and is not intended to keep out a serious break-in attempt. Appropriate types of data in this category might be prototype archives with data that is not yet scientifically sound; or data where the principal investigator still has first-discovery rights.
High-security access: access to these data and computing resources should be available only to authorized users, to those with root permission on a server machine, or to those who can watch the keystrokes of an authorized user. The data may be valuable intellectual property; and access at this level allows copying and deletion of files and long runs on powerful computing facilities. Appropriate protocols include Secure Socket Layer (SSL), Pretty Good Privacy (PGP), secure shell (ssh), and One-Time Passwords (OTP).

Once a user is authenticated to one machine, we may wish to do distributed computing, so there should be a mechanism for passing authentication to other machines. One way to do this is to have trust between a group of machines, perhaps using ssh; another way would be to utilize a metacomputing framework such as Globus, that provides its own security. Once we can provide effective access control to one Globus server, it can do a secure handover of authentication to other Globus hosts. Just as Globus is intended for heterogeneous computing, the Storage Resource Broker provides authentication handover for heterogeneous data storage.

Exceptions and Diagnostics

One of the most difficult aspects of distributed systems in general is exception handling. For any distributed system, each component must have access to a ``hotline'' to the human user or log file, as well as diagnostics and error reporting at lower levels of urgency, which are not flushed as frequently. Only with a high quality of diagnostic can we expect to find and remove not only bugs in the individual modules, but the particularly difficult problems that depend on the distributed nature of the application.

Data Parentage, Peer-review, and Publisher Imprint

Information without provenance is not worth much. Information is valuable when we know who created it, who reduced it, who drew the conclusions, and the review process it has undergone. To make digital archives reach their full flower, there must be ways to attach these values in an obvious and unforgeable way, so that the signature, the imprint of the author or publisher stays with the information as it is reinterpreted in different ways. When the information is copied and abstracted, there should also be a mechanism to prevent illegal copying and reproduction of intellectual property while allowing easy access to the data for those who are authorized.