Articles

homeoff.gif (237 bytes)  servoff.gif (356 bytes)  knowon.gif (598 bytes)   compoff.gif (406 bytes)  contactoff.gif (405 bytes)

arton.gif (326 bytes)

presoff.gif (457 bytes)

outoff.gif (411 bytes)

Maximizing Your Warehouse

Technology Tutorial, Part 1

By Neil Raden

Ever since it organizations began taking relational databases seriously, the entire industry has fixated on one benchmark: transaction throughput. All the hardware and software, the methodologies and tools, all the available bandwidth-human and machine-have been concentrated on moving huge volumes of data into computers as quickly, securely, and inexpensively as possible. u But the recent sea changes in computing-online analytical processing (OLAP) and data warehousing-caught many players by surprise. Suddenly, the quality and performance of loading, indexing, and query processing is an issue.

Data warehousing isn't a trickle anymore, it's a deluge. The Meta Group reports that 95% of companies surveyed either have or will begin building a data warehouse in 1996. This attention on getting the data out of the databases only illuminates the inadequacies in the current products. Like shining a light on the ugly troll under the bridge, flaws in the design methodologies, query optimizers, index structures, join algorithms, and even the core database engine are out in the open.

What To Look For In A Data Warehouse Database

Auto-aggression at load time, aggregation scheme advisors, or aggregate navigatoes to direct the query optimizer to use the highest level of aggregation

Bitmapped indexes for variables where the range of unique values in small, such as gender, color, and size

Join indexes for queries that span multiple large tables

A query optimizer that understands the star schema, especially real-world applications, not simplifications

A query optimizer that can handle complex queries by switching the plan in mid-query based on interim results

DATA:INFORMATIONWEEK

With the renewed emphasis on access to data, database vendors are scrambling to address the special needs of data warehousing. But while you wait for the technology to change, there are steps you can take now to improve the performance of your own data warehouse.

In this first of two articles, we will examine improvements in query optimizers and indexing, and describe how some vendors are addressing the special needs of data warehousing. In Part 2 next week, we will discuss parallelism and core technology improvements, including algorithm development, aggregation support, and the inclusion of new OLAP technologies.

The Problem

What is so special about data warehousing? In almost every way, data warehousing stretches the capabilities of database management systems. In the loading of a data warehouse, data extracted from operational systems often is added in a batch of millions of records at a time. Systems designed for transaction processing are overwhelmed by the volume of data. A data warehouse also needs special tools that eliminate the extraneous overhead of transaction logging, rollback/ commit, and incremental referential integrity checking. In addition, a common approach in data warehousing is to store aggregated information in the database, avoiding slow and costly "group by" operations by individual queries. Some critical components to look for in a DBMS are auto-aggregation at load time, aggregation scheme advisers, and aggregate navigators to direct the query optimizer to always use the highest level of aggregation available. These facilities are partially implemented in some specialized databases, but absent from all the merchant databases, such as those from Oracle, Sybase, Informix, and Ingres.

To facilitate the types of queries common in decision support, the indexes in data warehousing are larger and more complex. Rebuilding these indexes can be time-consuming, resulting in update cycles that are too long. To get the best performance from a data warehouse, look for DBMSs that have new index types (bitmaps, join indexes) and index building processes, including incremental indexing-using the existing index rather than the base tables as a starting point-and parallelizing the index processing.

A Simple Star Schema

iw_0396a.gif (3900 bytes)

DATA: INFORMATIONWEEK

OLTP Query Requirements

Online transaction processing systems are implemented in relational database systems with a complicated database schema, based on normalization rules that are the gospel of database designers. These designs simplify the job of updating databases by teasing out redundancy from the data.

In simple terms, normalizing data speeds transactions because the new or updated values don't need to be written into many different places.

Normalization and traditional B-tree indexes work well under the type of simple queries used in a transaction processing system. The B-tree index is designed for adding keys to existing indexes without having to sort or reindex the entire table.

But when an analytical query is made against this normalized design, the number of tables and alternatives in join paths can quickly overwhelm the query optimizer, which may join tables mechanically or force full-table scans, or both. The result is skyrocketing response time and query cost. With this in mind, IS shops shy away from this type of query and attempt to control access to the database through canned screens or forms.

To understand the implications of query processing on performance, consider an Accounts Payable system. A typical query might be to look up an overdue invoice or to generate a list of paid invoices from a particular vendor. The launching point for a query is generally a form, where the requester has the ability to specify the parameters of the query (vendor ID and dates). The developer tunes each form for the best response time by ensuring that the query makes adequate use of the index and that table joins are kept to a minimum. If the query requires multiple passes, the developer controls the order of the SQL SELECT statements and the order in which the tables are queried.

The drawback to this approach is the lack of a truly ad hoc query environment. In fact, the single most compelling motivation for building a data warehouse is to separate the ad hoc queries from the production system, which bogs down under the demands of free-form queries. One truism we have learned in our practice is that the most complex, demanding queries are generated by those with the least understanding of the underlying systems. This is a rather sad state of affairs, given the great (but unkept) promise of database systems to answer any question.

The biggest drawback, though, is that the IS group has never gotten free of the requirement to develop custom programs for each request. Since the highly formalized OLTP queries have placed little stress on the query optimizers, development of these tools by the database vendors has lagged behind more pressing OLTP issues, such as reliability, availability, and transaction speed.

The domain of data warehousing is analysis. Contrast an operational query (looking up a particular invoice) with a typical decision-support query that might ask the Accounts Payable system: "Show me the average turnaround time for payment of an invoice this year versus last year to this date, broken out by five invoice size categories ($0 to $1,000; $1,001 to $5,000; $5,001 to $10,000; $10,001 to $100,000; more than $100,000)." These types of queries can wreak havoc with a normalized database. To the extent that these queries can be controlled from managed query environments, similar in concept to OLTP forms, a certain amount of performance can be wrung out of the existing technologies and approaches.

But there are drawbacks. First, the IS organization is still in the business of responding to specific report requests-a slow and expensive process. Second, the usual approach is to build more indexes to insure that each query can be resolved without scanning the entire table. The downside is that while an index generally speeds queries, processing indexes slows down updating.

The Solution

Some clear-thinking people in prehistory (early 1980s) were alarmed at this paradox. The better the designs became-meaning the more closely they adhered to the normalization creed-the more query response times degraded. Determined to make relational databases live up to their promise of easy access to information, they conceived an alternative design methodology. Their approach was simple: Since the query optimizers get confused at more than a handful of tables, the answer is to build a database with only a handful of tables. Since this technique runs at right angles to transaction processing design principles, the experts recommended separate databases, or what are now called data warehouses. The design technique is known as dimensional modeling or the star schema.

Since understanding the basics of the star schema is fundamental to the rest of the discussion, I have presented a simple star schema (see chart, p. 43). Though this model is useful for discussion, keep in mind that real applications are more complex than this, often involving a dozen or more dimensions and multiple fact tables.

In fact, one of the hurdles that the database vendors need to overcome is a lack of appreciation for the variation in star schema.Models such as the one on p. 43 are useful only for describing the basic concept.

 

A Query Using Star Joins

SELECT customer_id, sum(dollars)                                                                     FROM customer table a, facts b, products c, time d,                                     WHERE income_category = HIGH AND                                                                       education_level = MASTERS AND                                                                              zip_code = 95032 AND season =                                                                              "Spring"                                                                       AND product = "Cruise Package" AND                                                                              season = "Spring"                                                                       AND a.id = b.id AND b.prod AND d.id                                                                               =   b.id

This query, which joins four tables, can confuse the query optimizer because the large tables with potentially millions of records appeat too early in the join process. A star-savvy query optimizer should know to select and join the dimension tables before joining with the fact table.

DATA: INFORMATIONWEEK                                  

Basically, a star schema is characterized by two types of tables: fact tables, which are rows of indexed facts (numeric values), with a primary key composed of foreign keys from dimension tables (highly denormalized descriptors of the facts). In the example shown, the dimensions are Store, Time, and Product, which form slices in which we can examine facts-sales in dollars and units and price. The power of the star schema is to answer questions such as, "What are the 10 fastest-growing products this year versus last year, and what are their contributions to total sales?"

The problem is that the star schema creates challenges for relational databases. First, the query optimizers are built to support transaction processing and need specific improvements to support the star schema, which I discuss below. Second, since a star schema, by design, requires multiple joins of very large tables, it does not handle queries that search out a needle in a haystack-that is, a single record.

Query Optimization

A particularly fertile area for improved query performance is the query optimizer, whose job is to interpret the question that is posed to the databases and to find the best plan for resolving it. Query optimization is one of those gee-whiz aspects of current technology that go largely unnoticed because they operate so transparently.

On the down side, query optimizers are not all they could be, and they are often ineffective. This is one area where the mainstream databases are particularly vulnerable to specialized databases (e.g., Red Brick Warehouse from Red Brick Systems in Los Gatos, Calif., and Sybase IQ), since the latter are completely focused on the "getting the data out" problem, not on transaction speed.

But there is no fundamental reason why a query optimizer could not be designed to provide the best plan for both small transactions and huge, multitable decision-support joins. That, in fact, is exactly what the mainstream database vendors are scrambling to add to their products.

How does an optimizer work? Although the internals of every optimizer are trade secrets, what is secret is usually the implementation of well-known principles. For example, a cost-based optimizer might try to minimize the physical reads of the database. To do this, it consults statistics in the system catalog on the cardinality (number of records) in each table, the ranges of values in an index, and perhaps even histograms-meaning the distribution of values in a particular column. Applying its proprietary algorithms, the query optimizer selects the best option or path.

Query optimizers built to support OLTP need specific improvements to support the star schema. The wrinkle is the number of options to consider. Certain rules, or heuristics, are applied to reducing the search space, or narrowing the number of access paths available to the optimizer to satisfy a query. The problem is these heuristics are useful only if they incorporate knowledge about the type

of questions being asked. But the heuristics employed by the mainstream optimizers are pretty weak for decision-support queries, and often lead to the selection of a poor path. Another problem is that the size of data warehouses makes it prohibitively expensive to recompile statistics, which further weakens the effectiveness of the cost-based optimizers.

Star Joins

Data warehousing pioneers such as Red Brick Systems rely less on statistics and more on a burned-in understanding of the star schema and its variations. This is particularly noticeable in the performance of so-called star joins (see chart, p.44)

This query requires the joining of four tables, at least two of which can potentially have millions of records. One sure sign that the query optimizer is confused is the appearance of the large tables too early in the join process. A star-savvy query optimizer should know that selection and joining of the dimension tables should precede the join with the fact table.

Unfortunately, this is a simplification. Early attempts by mainstream database vendors to support star joins were limited in the number and cardinality of dimensions because they applied Cartesian (cross-product) processing on pairs of tables at a time, which could quickly exhaust system resources. The problem is some star schema designs have 25 to 30 dimensions. With this volume, cross-products simply can't work. The solution is a different approach that employs a join index, which I will describe more fully in the section on indexing.

More Difficulties

Another drawback with cross-product processing of the dimensions is that many star-schema designs involve multiple fact tables. Consider, for example, a design that includes point-of-sale transactions, orders, and shipments in three fact tables, all sharing the same product dimension.

A query to illustrate inventory movements over time would need to intersect the product code key on each of the fact tables-an impossibility with cross-product processing because each fact table can have more than a billion rows.

A useful and promising technique is to improve the query optimizer by allowing it to adjust its approach in mid-query. For example, star joins and star schema are only one type of design for analysis, which may be called an outside-in query (given these constraints, do some operation on the records that satisfy them).

When tables reach tens or hundreds of millions of rows, B-trees and hash indexes become unwieldy. But some queries-for example, database marketing, as shown in the chart below-use a fact as a screening condition, and return attributes (name, address, and zip code) as a result of the query, rather than a tabulation of values. These queries, which may be called inside out, present a different problem.

The query shown is similar to the star query on p. 44, but the distinguishing factor is that at least one of the constraints is on a fact, not a dimension attribute. This query not only requires different indexing, described below; it also combines elements of the outside-in query. What is required is a notion of dynamic optimization at run-time, such as converting from a star join in the middle of executing the query to a parallel relation scan. Only specialized database vendors have tackled this problem to-date, but some of the relational database vendors are examining it.

SQL Rewriting

Another exotic technique is the ability of the query optimizer to rewrite the structured query language presented to it into a more efficient form. Even simple requests in SQL can be constructed in several ways, and performance shouldn't depend on anyone's locution. Unfortunately, this is generally not the case, as different phrasing can cause the query optimizer to behave in odd ways. A query optimizer that has to respond to an infinite variety of phrasings will be slower and more limited in finding the optimum path. One approach to resolving this dilemma is to recast the SQL in phrasing that the optimizer is comfortable with. The hope is that the complexities in dynamically rewriting SQL at run-time will be more than offset by a simplified query optimizer, since it will only have to digest sanitized queries. There are no production releases of this technology, but some vendors are working with the concepts in their labs.

Indexing

Traditional relational database indexing is fairly uniform across products and generally comes in two flavors: B-trees and hashes, each of which has its strengths and weaknesses. For the most part, neither one is satisfactory for decision-support environments.

First, these indexes essentially map column values of rows in a single table and provide no information about the intersection of these values across tables. Second, the content of the index is either the attribute value; a unique, internal identifier of each row, called a row ID or RID; or the unique key of the row that contains it. When tables reach tens or hundreds of millions of rows, these indexes become unwieldy, consuming huge amounts of memory and/or causing swapping and thrashing in memory. Third, the index structures are organized for transaction databases and do not take advantage of the stability of the data in a data warehouse between bulk update periods.

An Inside-Out Query

SELECT customer_name, address, zip_code                                                          FROM customer table a, facts b, products c, time d,                                       WHERE income_category = HIGH AND                                                                       education_level = MASTERS AND                                                                            dollars > 10000                                                                       AND product = "Cruise Package" AND                                                                            season = "Spring"                                                                       AND a.id = b.id AND b.prod = c.prod                                                                            AND d.id = b.id

This query is similar to the star-joins query, but at least one of the constraints is on a fact (dollars > 10000), not a dimension attribute. The query optimizer needs to be able to dynamically switch the plan in mid-query based on interim results.            

DATA: INFORMATIONWEEK                                                               

Two techniques for decision-support databases are the bitmapped or bit-vector index and the join index. Red Brick has incorporated its own version of a multitable join index, the StarIndex, since its initial release, and was the first of the major players in the data warehousing to release a production version of a bitmapped index. Other vendors have commercial versions of bitmapped indexes, including Mercantile Software Systems in Piscataway, N.J. This year will see more entries in this arena, including Oracle and Sybase IQ. Informix will support join indexing in version 8 of its database. A version of join index processing has been a part of Ingres for years.

Join Indexes

The typical index in a relational database is a B-tree that maps the values in the indexed column to a RID of a single table. Join indexes, whether or not implemented as B-trees, map the column values to the RIDs of more than one table. For example, in the table on p. 48, the RIDs of the three dimension tables are mapped to the fact table RIDs and combined in a single index.

Please note there is variation in the way these indexes can be sorted, but the effect is the same: to locate the rows of one or more tables by arbitrary column values in another table. The example shown is a simplified version of a star index, but one could just as well create separate join indexes for any column in a dimension table and its matching RIDs in the fact table. Another option is to create these compound star indexes for multiple fact tables, and "join" the fact tables by joining the indexes.

Join Index For A Star Schema

iw_0396c.gif (6993 bytes)

DATA: INFORMATIONWEEK

Multitable join indexes haven't been exploited in relational databases because they perform best when the database is read-only; online updates with these pre-join schemes are conceptually difficult and probably sluggish. But the importance of data warehousing creates a market for these advanced indexing techniques, and they are being developed by virtually every vendor.

Bitmap Indexes

One issue that needs to be resolved, however, is a method for keeping the index as small as possible, and bitmapped indexes offer one appealing approach.

The specialized databases enjoy an advantage, but everyone is innovating. Suppose a dimension table for customers has 2 million records, and gender is a column, and further assume that the possible values for gender are M or F (seriously, in some databases, there can be more values, representing "unknown", "not given", or even NULL). These "low-cardinality" variables are perfect for bitmap indexing.

Instead of storing two lists, one with all of the RIDs where gender = M and the other where gender = F, consider two strings of bits 2 million bits long-one for M and one for F. Each bit position is set to either 1 or 0 to represent the gender for that row. I can now have the entire 2-million-record table indexed for the gender column with just 2 million bits, or approximately 250 Kbytes-a very small index by today's standards. Comparisons and join operations can be reduced to bit arithmetic, yielding dramatic improvements in query speed, often in the range of one or two orders of magnitude.

The drawback is that as the cardinality of the values in the attribute increases, so does the number of 2 million-bit strings in the bitmap. Therefore, if the attribute can assume any of, say, 30 values, the size of the index will grow to almost 7 million bytes. One compensation is that as the cardinality increases, each bit vector becomes sparse, meaning it is filled mostly with zeros. Compression techniques, like run-length encoding, can squeeze much of the space out, but most implementations of bitmap indexing work best with low cardinalities.

Help Is On The Way

Improvements in query optimizers and indexing techniques offer excellent opportunities for improved performance of databases for decision support. The specialized databases continue to enjoy an advantage over the merchant databases, but everyone is innovating. In the next 12 months, every vendor will introduce product enhancements in this area, raising the bar a few notches.

It is increasingly important to evaluate these technologies thoroughly, making sure that the new features are integrated with every other feature. For example, high-speed loading processes need to be tightly integrated with storage management, referential integrity, and indexing processes. In most cases, they aren't there yet.

Next week, I will examine two promising areas: parallelism and core technology. In the former area, the specialized database vendors may be at a disadvantage.

------------------------------------------

Neil Raden is president and principal consultant of Hired Brains Inc., an international data warehousing consulting firm with offices in California, Chicago, Toronto, and New York.   He has authored dozens of magazine articles, books, and white papers on information technology implementation, decision support, data warehousing, and OLAP, and is a frequent speaker on these topics. He can be reached at nraden@hiredbrains.com .


homeoff.gif (237 bytes)  servoff.gif (356 bytes)  knowon.gif (598 bytes)   compoff.gif (406 bytes)  contactoff.gif (405 bytes)

Copyright © Hired Brains Incorporated, 2002. All rights reserved.
Webmaster@hiredbrains.com