Articles

homeoff.gif (237 bytes)  servoff.gif (356 bytes)  knowon.gif (598 bytes)  compoff.gif (406 bytes)  contactoff.gif (405 bytes)

arton.gif (326 bytes)

presoff.gif (457 bytes)

outoff.gif (411 bytes)

Modeling a Data Warehouse

By Neil Raden

Organizations have an insatiable appetite for data, but frequently lack the essential enzymes to digest it. It sometimes seems like the right brain is preoccupied with capturing and storing information in greater volume and in more minute detail every day, while the left brain is at a loss to make use of it. The situation reminds me of the story in which H.L. Mencken is asked by a young reporter how much he would pay for 10,000 words on the San Francisco Earthquake. Mencken's reply is now a classic: "Which words? In what order?" The point is that the volume of data is irrelevant if it isn't organized in a manner to add value. And value to an organization means turning data into actionable information. Multidimensional modeling (MDM) is one approach to making sense of this mountain of data.

What Is Multidimensional Modeling?

Sometimes introducing a new topic is easier when describing what it is not. Do you remember being told that "A car is not a toy," or "Life isn't easy?" The implication was clear, and a lot of murky description was avoided. Let's start in the same vein with Multidimensional modeling (MDM). First of all, MDM is not tied to a physical representation of data. It is not inseparable from multidimensional databases (like Essbase, Express, Gentium). It does not involve squeezing information into cubes. It isn't terribly difficult to understand or to practice, and it certainly isn't new or the latest fad. Despite the inordinate amount of attention MDM is getting these days, the concept has been around as long as end-user computing, starting in the 1960's with products like Express and APL. And even the lowly spreadsheet is a multidimensional model, albeit a two-dimensional one.

iw0196_a.gif (3842 bytes)

What MDM is a technique for conceptualizing business models as a set of measures described by ordinary facets of business. It is particularly useful for sifting, summarizing and arranging data to facilitate analysis. In contrast to the techniques for designing On-Line Transaction Systems (OLTP), which rely on entities, relationships, functional decomposition and state transition analysis, MDM uses facts, dimensions, hierarchies and sparsity. Multidimensional models are about numeric data, such as values, counts, weights and occurrences. Though a typical OLTP problem statement might be, "Model the order fulfillment process," a MDM problem statement is, "What is my profitability by customer over time, by organization?"

Part of the allure of MDM's is their simplicity. A fully normalized OLTP design for an order entry system may have dozens or even hundreds of tables, and making sense of the design for analysis is daunting. Even early Woody Allen movies aren't as funny as watching a well-intentioned data modeler trying to do a "walk-through" of the model with the (euphemistically named) "business people." MDM's, on the other hand, are quite simple, even elegant in comparison. They not only are understood by the "business people," they are expressed in a way that is quite natural to them. What's more, a good conceptual MDM can be implemented in a relational database, a multi-dimensional database or even an object-oriented database.

The Multidimensional Model Compared to the Relational Model

The characteristics of the relational model are well understood; its ability to support operational processes is its raison d'être. In contrast, the multidimensional model is designed to support the reporting and analytical needs of knowledge workers, and can best be described by contrasting it with the relational model in four fundamental ways:

  • The MDM views information from the perspective of a "slice of time" instead of atomic transactions
  • The MDM is globally consistent for the enterprise; the relational model is internally consistent
  • OLTP systems provide for detailed audit trails, MDMs are better for the big picture
  • Relationships are modeled explicitly in the relational model, implicitly in the multidimensional model

                           iw0196_b.gif (4453 bytes)

Transaction View versus Slice of Time

OLTP systems record events, or transactions. Examples are journal entries, purchase orders, billing items, hamburgers dropped on the floor or all those collect calls on Fathers' Day. The multidimensional data model is not concerned with actual events, only the quantitative result of them at some interval in time, such as days, weeks or months.

Local Consistency versus Global Consistency

A properly designed OLTP system is consistent within its own scope. For example, a General Ledger is locally consistent if it properly records all of the relevant transactions, performs allocations and maintains its account master. But if the data warehouse combines information from several General Ledger systems, each with its own chart of accounts, global consistency is not automatic. The multidimensional model starts from a globally consistent view of the enterprise, meaning a single chart of accounts, in this case.

Audit Trail versus Big Picture

When a customer has a question about their credit card bill, they want to see every transaction. When your overnight package is lost, you want to know who was the last person to see it intact. The systems that manage this information ensure the highest level of confidence and security. Is it any wonder that they have shortcomings when dealing with analytical questions? The multidimensional model is designed to answer questions like, "Will I make money on this deal or not?" or "Who are my best customers and why?" or "What opportunities are we missing?"

Explicit Relationships versus Implied Relationships

Entity-Relationship modeling is the heart of the relational model. The explicit relationships between customers and sales orders, or between hamburgers and buns are burned into the design of the database. In multidimensional modeling, these relationships are implied by the existence of "facts" at the cross-section of dimensions. For example, if there are sales dollars to Customer 101, of Product 2345, then the relationship between customer and product is implied.

Option 1: The Classic Star Schema

iw0196~1.gif (4538 bytes)

In the star schema for database design, each dimension is described by its own table and the facts are arranged in a single large table, indexed by a multi-part key that comprises the individual keys of each dimension.

DATA: INFORMATIONWEEK

HOW-TO: SIX BASIC QUESTIONS IN MULTIDIMENSIONAL MODELING

MDM isn't concerned with the movement of information and events (in general), but rather is concerned with a series of "slices in time." The first step in building a model is to pick a business subject area (weekly sales reporting, monthly financial statements, insurance claims costs) and model it by asking six fundamental questions.

  1. What business process is being modeled?
  2. What are the measures (or "facts")?
  3. At what level of detail (granularity) is "active" analysis conducted?
  4. What do the measures have in common (the dimensions)?
  5. What are the attributes of the dimensions?
  6. Are the attributes stable or variable over time, and is their "cardinality" bounded or unbounded?

First, a few definitions. "Active" analysis means the ability to manipulate information. This can refer to mechanical manipulation, like pivoting, drilling or graphing; agent-based manipulation, like alert reporting, exception reporting, or other agent-based activity; or workflow manipulation, like publishing, distributing or notification. Certain products allow a "reach-through" or "drill-through" process, where analysis can occur at a summary level, but detail at lower levels is only available in listing format; it is not actionable. That is why you must define the boundary for active analysis. It defines the limits of your manipulative analytical reach.

"Cardinality" is just a lofty way of saying "how many." Since the frustrated math teachers who invented relational theory based their model on mathematical set theory, the appellation stuck. Boundedness is another obtuse mathematical term, which simply refers to whether or not there is an upper limit to the cardinality. If you build your multidimensional model in a relational database, unbounded cardinality is expected. The opposite is true in the multidimensional databases, where changes in cardinality often require complete reorganization of the database, an extremely time-consuming process. In any case, this is why cardinality affects the design of your model.

Moving through the questions: Trying to do everything at once should be avoided. Instead, concentrate on a bounded area like customer profitability or sales reporting. One exercise we use in our practice is to have clients formulate a simple statement describing the intent of the model. For example, "Net sales, in dollars and units, of every product, at every store, by week, by company for the last three years, compared to shipments and budgeted sales." This simple statement gives us the business process (retail sell-through data), the facts (Sales $, Sales units, return units , shipment units, price), the dimensions (customer, product, time and view) and the granularity or lowest level of detail (sales, shipments and returns of individual products, aggregated by week and counter). The attributes are worked out by evaluating the rich features of the dimensions.

Following the six questions will help you refine your thinking and lead to a solution that is manageable. Before you start, though, you will need a lot more information about facts, dimensions, attributes and, a major problem in MDM, handling sparsity.

Facts, Dimensions, Attributes and Sparsity

The basic components of the multidimensional model are facts, dimensions, attributes and sparsity.

Physical Representation in Relational Databases

                      iw0196_c.gif (4545 bytes)

The first two components, facts and dimensions, are represented physically in a relational database as tables. In the simplest MDM, the basic star schema, they are the only tables. In the star schema, each dimension is described by its own table, and the facts are arranged in a single large table, indexed by a multi-part key that is made up of the individual keys of each dimension.

There are many variations on the simple star schema, but they all share the basic concept of fact tables and dimension tables. In many cases, not all facts share the same dimensionality, and multiple fact tables may be used. An example is selling price, which may not vary between markets or customers, or the Federal Income Tax rate.

When dimensions are large (high cardinality), it often makes sense to split the dimension tables at the level of the attributes, also known as the "snowflake schema." Storing aggregations and derivations yields even more exotic combinations, sometimes referred to as "decomposed stars" or "constellation schema." The point to keep in mind is that dimensional modeling for relational databases is designed to create fact tables, usually long and thin, and relatively small, short and wide, dimension tables. While the fact tables contain the actual numeric information, all of the interesting information is in the dimension tables. Queries are designed to exploit this by using the dimension tables for counts, control breaks, aggregation paths and searching for properties of the elements. Many queries can be resolved without even touching the fact table. When "facts" are needed, the method is to gather the key values from the dimension tables and then pull the matching records from the fact table, avoiding costly and time consuming table scans and complex joins.

<Picture>The second two components, attributes and sparsity are not "entities" or represented as tables. Attributes are the extended descriptions and hierarchies of the dimensions, such as brand, color and size in the Product dimension (see diagram at left of an extensive product dimension). Sparsity is handled implicitly. In the simple example shown, it is likely that facts (Dollars, for example) will exist for only a small fraction of the Cartesian product of Product, Market and Period. For example, assuming that the model has a total of 120 markets, 20,000 products and 1,098 days (or 2,625,200,000 possible combinations), there may in fact be actual sales for only 1% or even 0.1% of the possible combinations, sometimes described as 99% or 99.9% sparse, respectively. The star schema model handles sparsity by simply not recording records where those combinations are invalid. In a simple star, that translates to 26,252,000 rows at 99% sparsity or roughly 1-2GB for the model shown.

Facts

Since dimensional modeling always involves hierarchies in dimensions, aggregation of information is a key element in the usefulness of the model. Since aggregation is an additive process, it is best if facts are limited to additive, numeric values. It is possible to deal with non-additive facts (text, for example) but unless the text is unique for every record in the fact table, it belongs in the dimension tables. Some facts are only semi-additive, like certain counts. Some facts are numeric and still non-additive, such as prices or rates. Special care is needed in handling these facts, since they can add the potential of incorrect answers to queries. Some front-ends add functionality to deal with this, but the handling of non-additive facts is an advanced topic in data warehouse design.

Selecting the facts for your model is relatively simple: once a business subject is selected, the list of facts is the answer to the question, "What are we measuring?"

Dimensions

Dimensions are the classes of descriptors of the facts. If the fact is sales, the dimensions might be time, geography, customer and product. We say that dimensions are "classes" because they can be broken into "attributes." The attributes of a product dimension, for example, may include a hierarchy of item, brand, class, division. Or the attributes may be descriptive, such as "large metro", "medium size," "down market" or "yes."

The concept of dimension is fundamental to the MDM. Most modelers and vendors by now use the phrase "common business dimensions" as if it were obvious just what they are. In fact, the most-often used dimensions are surprisingly common across applications and industries, but it's in the differences that the real challenge lies. For example, it is rare that a MDM does not include time as a fundamental dimension. It is so common, in fact, that most tools either have or will shortly implement time intelligence as a feature, which means that calendar conversions and day->week->month->quarter->year roll-ups are automatic and even some special cases like time ranges (promotion periods, seasons, etc.) and fiscal periods are handled without much effort.

Other so-called common business dimensions are Geography, Company (meaning the hierarchical breakdowns of your organization), Customer and View/Scenario (such as budget /actual/forecast). In fact, the three most common MDM's are financial reporting, retail sales reporting and consumer products sales reporting. Here is how the "typical" dimensions line up:

Time Geo. Org. Cust. View Retail yes yes yes yes ? Consumer yes yes yes yes ? Financial yes no yes ? yes

If you look carefully at these dimension, you may wonder if this is really the proper way to divide things. After all, geography and customer may be closely related. The key point to remember is this: if you separate attributes into dimensions, then you must have data at that detailed level. The opposite is also true: if you collapse two or more dimensions into one, you typically lose detail. For example, if my customer is Macy's, and I have a geography dimension, then I must capture information at the various geographical locations, but not all of the geographical locations, just the valid ones - remember what we said about sparsity. The alternative is to combine dimensions, with dimensional members like "Macy's East," "Macy's West" and "Bullocks." In the latter case, we've chosen to not capture geographical detail in all instances. The drawback is that intermediate aggregations based on geography are not feasible in this scheme, except by hard-coding (East = Macy's East, West = Macy's West + Bullocks). But if there isn't a clean division at the granularity level, it won't work; if "Bullocks" in the example above included locations in the West and South, regional aggregations are out.

An even greater drawback is that if the hierarchy of an attribute (like geography) is not expressed explicitly in the design, drill-down is impossible. That is, unless the navigation tool can understand the relationship, it can't operate on the information. In the combined customer/geography example above, this problem is solved if the multidimensional database can understand multiple hierarchies in the same dimension; in the Relational/OLAP products, a relationship table depicting the multiple hierarchies is needed in metadata. It is mentioned here to help you understand how the tools interpret the relationships.

Attributes

Attributes provide the depth of the dimensions beyond identifying codes. Through the attributes, we can find the hierarchies of the dimensions, detailed descriptions of each dimensional element and extended properties. It is common for dimension tables to have over 100 columns of attributes; properly designed queries or front-end query generators that understand MDM exploit these highly denormalized structures to return complex queries quickly. Let's look at how this works.

Assume that our model has a product dimension with an attribute for "type" with "low-fat," "ethnic," "premium," "plain-label" and "special diet" as possible values. This is important, and is a key differentiater between relational and multidimensional implementations of the MDM. Even though there is no defined hierarchy for these attributes, it is possible to generate queries very quickly by using the power of the star schema. Try the following example: How well are "low-fat" products selling this quarter versus the same quarter last year?

In SQL, this can be done using the correlated subquery as follows:

select product, dollars as sales_95,
select dollars from market cm, product cp, period ct, sales cs
where ct.month = pt.month
and ct.year = pt.year-1
and cp-product = pp.product
and cm.city = pm.city
as sales_94
from market pm, product pp, period pt, sales ps
where year = 1995
and quarter = 4
and product in (select product from product where type = "low-fat")

Notice the last select statement: this is the key to star schema - collect the keys from the dimension table by looking for constraints there, not the fact table. This is how table scans are avoided. Unfortunately the query optimizers of many relational databases are not clever enough to understand this yet, but they are beginning to catch on. Though it is more clumsy, the same result could be achieved without the correlated subquery by using multiple SELECT statements and TEMP tables, with a final join. Unfortunately, this is a weakness in SQL, which simply can't do an outer join and stitch the rows side-by-side in a sort-merge. The correlated subquery is a little exotic, but your front-end should mask this complexity from users. For those relational databases that are thinking clearly about multidimensional modeling, particularly Red Brick, the response time for these queries is exceptional.

Summary

We covered the basics in this article, but there is much more to MDM than basics: advanced topics in schema design for relational databases, techniques for designing multidimensional databases (which we did not cover), the complexities of multiple hierarchies, cross-dimensional relationships, partial dimensionality and handling of partially-additive and non-additive facts. The dimensional models for some processes are far more complex than the examples presented here, especially those that involve complex products, like financial instruments or insurance; any use of "householding" such as credit and or direct marketing; subscription schema involving prepaid packages of services; and itinerary-based models. If your first foray into multi-dimensional models involves any of these aspects, it is best that you enlist the help of skilled practitioners who can assist you.

Multidimensional modeling is more difficult to describe than to learn. All that is really needed is an understanding of the business process being modeled, the discipline to keep the model from expanding unnecessarily and a few basics. A healthy by-product of every modeling effort is a deeper and clearer understanding of the elements of your business. Very few participants in a multidimensional modeling effort come away without having learned something valuable about the drivers and relationships of the process under consideration.

--------------------------------------

Neil Raden is the president and founder of Hired Brains, Inc., an international data warehousing consultancy with offices in California, Chicago, Toronto, and New York.  He has authored dozens of magazine articles, books, and white papers on information technology implementation, decision support, data warehousing, and OLAP, and is a frequent speaker on these topics. 

Copyright (c) 1995 by Neil Raden. All Rights Reserved No portion of this document may be reproduced without the written consent of the author. Your comments are welcome via E-mail at nraden@hiredbrains.com .


homeoff.gif (237 bytes)  servoff.gif (356 bytes)  knowon.gif (598 bytes)  compoff.gif (406 bytes)  contactoff.gif (405 bytes)

Copyright © Hired Brains Incorporated, 2002. All rights reserved.
Webmaster@hiredbrains.com