``Data Publishing: An Alternative to Data Warehousing''
Dr. Kenneth W. Church
AT&T Laboratories
Florham Park, NJ 07932
E-mail: kwc@research.att.com
Data publishing is like data warehousing, but with an emphasis on distribution. We don't want ``Roach Motels,'' where data can check in, but it can't check out. What is important is not the size of the inventory, but how much is shipped. The more people who look at the data, the more likely someone will figure out how to put it together in just the right way. But unfortunately, as a practical matter, computer centers are very expensive, so expensive that only big ticket applications can justify the expense. Others cannot afford the price of admission. So much more could be accomplished if only we could find ways to make it cheap and convenient for lots of people to look at lots of data.
Data Warehousing | Data Publishing |
Centralization | Distribution |
Large Investment | Affordability |
-- Big Databases and Indexes | -- CD-ROM/Web |
-- Support (computer center) | -- Shrink-wrapped |
-- Big Iron | -- No Iron |
-- Provision for peak load | -- ``No load'' |
-- On-line | -- Off-line |
High payoff applications | Mass Market |
Datamining need not be expensive. Large datasets are not all that large. Customer datasets are particularly limited. AT&T has a relatively large customer base, approximately 100 million residential customers. If we kept a kilobyte of data on each customer to store name, address, preferences, etc., the customer dataset would be only 100 gigabytes. The hardware to support basic datamining operations on a dataset of this size costs less than $15,000 today ($100/gigabyte for the diskspace + $5000 for the CPU). In a few years, I would hope that anyone could process a mere 100 gigabytes with whatever word processor is already on their desk.
A more interesting challenge is to mine purchasing records, but even these datasets are not all that large. The 100 million customers purchase about 1 billion calls per month. This kind of data feed is usually developed for billing purposes, but many companies are discovering a broad range of other applications such as fraud detection, inventory control, provisioning scarce resources, forecasting demand, etc. If we were to store 100 bytes for each call, a generous estimate, then we would need a 100 gigabytes of diskspace per month (= $10,000/month at $100/gigabyte). Thus, a month of purchasing records is about the same size as the customer dataset. A year of purchasing records is about a terabyte ($100,000 of disk space).
A hundred gigabytes or a terabyte might seem like a lot, but it really is not. The costs to replicate a dataset of this size are very reasonable, approximately $10-$100 thousand, or about what we used to spend for a workstation in 1980. Assuming that computer prices continue to fall, datamining stations will soon become about as commonplace as workstations are today.
Of course, there are many other challenges that need to be overcome before datamining stations become a commodity: