Articles

homeoff.gif (237 bytes)  servoff.gif (356 bytes)  knowon.gif (598 bytes)  compoff.gif (406 bytes)  contactoff.gif (405 bytes)

arton.gif (326 bytes)

presoff.gif (457 bytes)

outoff.gif (411 bytes)

Boost Warehouse Performance

Technology Tutorial Part II

By Neil Raden

In last week's article, I covered two key technologies to look for in order to maximize the performance of your data warehouse: query optimization and advanced indexing. This week, I examine two more performance-enhancing technologies: core engine optimizations and parallelism. u Several areas of a database engine can be optimized for decision-support performance, including support for the multidimensional data model, load processing, and calculation and query processing. Clearly, database vendors must provide functionality that is specific to decision support or online analytical processing (OLAP)

But what is not clear is how much functionality is needed. After all, the relational/OLAP tools, which augment relational databases by providing OLAP capabilities, client interfaces, and metadata tools, perform very well for data warehouses. The multidimensional databases also provide OLAP functionality and duplicate many of the data management functions of relational databases.

Organizations that use relational databases to house their critical operational systems, assuming that it's more cost-effective to support fewer technologies, have a built-in bias to standardize on one product for both their operational database and data warehouse. Though this thinking is easy to refute, its intuitive appeal often suppresses any argument. The net effect is that the major relational database vendors are being asked to add technologies that specifically support data warehousing, and the vendors now feel a compelling reason to instill this functionality in their products. They're doing just that, albeit slowly. Core database engine optimizations and parallelism are two areas where they need to do the most work.

Query optimizers that work with the star schema (a multidimensional data model expressed in relational tables) and indexing that exploits that model are good first steps, but there is much more that needs support. Multidimensional models by nature imply certain kinds of operations, such as aggregations, calculations, and time intelligence, that are not supported by the relational database engines.

Aggregation

Since almost every query asks for some summarized data, aggregation, which is pre-summarized data, is the single most important technique for improving performance in data warehouses.

The success of your aggregation strategy depends on selecting the right combinations of aggregates, total database size, and relative query cost (see chart, p. 80). Note that the first few aggregates increase the size of the database dramatically, but substantially improve query response time. But additional aggregations increase the database size without delivering better query performance. With experimentation, you'll find the right number of aggregates to get the best bang for your buck.

pktmp000.gif (11931 bytes)

You need to experiment with the number of aggregations of data.  Note that the first few aggregates increase the size of the database dramatically but substantially reduce query response time.  Additional aggregations increase database size without delivering better query performance.

DATA: ARCHER DECISION SCIENCES

What the graph shows is that only a small subset of the possible aggregations in a model will provide performance gain in queries, and identifying that subset is difficult since it can change.

When the composition of the aggregate tables changes, users need to rewrite queries to take advantage of the new aggregations. What all the databases lack, unfortunately, is automatic aggregate optimization. Currently, aggregate optimization is fairly basic and handled almost exclusively by nonrelational products, such as DSS Agent from MicroStrategy Inc. in Vienna, Va., and Intelligent Warehouse from Hewlett-Packard. Red Brick Warehouse from Red Brick Systems in Los Gatos, Calif., supports automatic aggregate updates in its load process. Informix will incorporate the functionality of its newly acquired MetaCube in a future release of the Informix database. This will provide an aggregate advisor that suggests creating aggregate tables based on the schema design and usage, and rewrites the SQL at run-time to take advantage of the aggregates. Expect other vendors to follow.

Calculations

A common operation used in business analysis is the moving average, which finds trends over time. The calculation rules for a moving average are different from a simple average because the moving average requires that the data be in order-something that can't be done directly in SQL. Other needed functions not readily supported in SQL are rankings, n-tile orderings (top 25%, bottom third, etc.), and common statistics, such as variance and standard deviation. Clever workarounds may suffice to simplify the architecture, but without native support for the calculations, performance suffers.

Once again, the databases designed for data warehousing address most of these issues through extensions to SQL. But most organizations prefer a general, portable solution that uses a common set of extensions. With the increased focus on analytical processing, expect to see some progress in this area in the next few years.

Almost every data warehouse that uses dimensions has time as one of those dimensions. In fact, time is one of the distinguishing characteristics of a dimensional model. Since the functions relating to time are so pervasive in decision-support applications and since standard SQL is limited to date arithmetic and date comparison, the database used for a data warehouse needs to be able to perform more complicated time calculations, including time aggregation, calendar conversions, time transformations, and automatic conversion in calculations with data having different periodicities (for example, comparing daily sales to the monthly budget)

Time aggregation is usually simple. Since days always roll up into weeks and months, months into quarters, and quarters into years, these aggregations should be automatic. Real-life time aggregations are more complicated. For example, which weeks roll up to which quarters? Is the first quarter the calendar months of January, February, and March or the first 13 weeks of the year that begin on Monday? Weeks don't cleanly roll up to years since a calendar year is one day longer than 52 weeks (two days longer in leap years). Calendar conversions and time transformations can solve this problem by creating core functionality to deal with these problems.

Another clumsy area in time calculation is the difference between calendar and fiscal periods. This can be exacerbated when the warehouse includes data from multiple organizations, each of which have slightly different calendars.

Different time periods for the data elements in a model introduce yet another complexity. Using relational databases and SQL to perform calculations with different values adds complexity to the design and the development cycle and degrades performance. Many time-series products, such as Lightship Server from Pilot Software Inc. in Cambridge, Mass., deal with time-related complexities.

Red Brick Warehouse and Sybase IQ are examples of relational databases that provide specialized solutions for decision-support applications. Red Brick makes no apologies for its lack of conventional transaction processing features; it concentrates all its energies on data warehousing. Sybase chose a completely different tool, Sybase IQ, and integrated it into the Sybase catalog. Informix, Oracle, and IBM chose to add data warehousing capabilities to their database engines. This creates some difficulties in integrating those features into a database designed for transaction processing.

For example, in transaction processing, the data being updated is locked to provide a consistent view. In data warehousing, locking is not necessary because the database is typically read-only.

Also in transaction processing, to ensure accuracy in the database, updates are applied to a temporary file; only when the transaction is complete are they applied to the operational database and the lock released. This process creates a great deal of overhead, including transaction logs, rollback logs, and more-none of which is necessary in a data warehouse.

As a first step, the databases need to selectively relax some of these constraints to provide better performance. The problem: these features are so pervasive in the database engine that disengaging some of them may require redesigning the whole engine.

If you extrapolate this factor, it reveals an even larger stumbling block, which is at the heart of the merchant database vendors' strategies. Relational database schema designed for transaction processing can never perform as well as specialized schema, like the star, for decision support.

To transparently provide transaction processing and decision support in a single view, database vendors are developing replication schemes to transpose online transactions into decision-support schema in real-time, hopefully using their existing re- plication strategies and tools.

The multidimensional database vendors are betting this problem won't be solved soon, and I agree. The entire fabric of transaction logging, locking, and replication in its present form requires too many computing resources to be cost-effective for a data warehouse, especially with the relational/OLAP and multidimensional databases adding capabilities and raising the bar.

Load Processing

Everyone is sensitive to performance issues, especially people who pay for these products. That's why tool vendors turn their attention first to those processes the user sees. Multidimensional databases are often sold on the merits of their query processing.

This is a dilemma, because query processing may get all the attention to the detriment of load processing. The multidimensional databases are vulnerable in this area, too. They may provide extremely good response time for interactive queries, but load times may exceed the available window of time. That's why specialized load processing for the large scale of a data warehouse is such a critical feature.

Here is a list of functions that the database should support to provide load processing that works for data warehousing:

  • Reorganize the data from external flat files to match the relational schema of the data warehouse;
  • Perform data type conversions, such as EBCDIC-to-ASCII;
  • Check incoming data for referential integrity;
  • Update indexes;
  • Calculate and update aggregates. This often implies an UPDATE, not INSERT. Note that many bulk loaders do not support UPDATE, and none of the merchant databases supports aggregation;
  • Write data to physical storage based on the predesigned segmentation, device placement, balancing, striping, etc.

Creating an aggregate table may require a GROUP BY operation on millions of rows in a single query. Clearly, a database engine optimized for transaction processing will most likely not have algorithms that can scale adequately. Performing 2,000 GROUP BY operations a second, each involving a handful of records, does not imply that the engine can do a handful of GROUP BY operations per second, each involving 2,000 records.

All the basic algorithms in a relational database are subject to this scalability problem, including GROUP BY, ORDER BY, DISTINCT, IN/NOT IN, and UNION. For database vendors to successfully provide an engine that can handle real-world data warehousing, they will need to apply a great deal of effort to improving these core algorithms.

Parallelization

Parallel computing, even in the narrow focus of data warehousing, is such a broad area that I can cover only some of the basics here. Essentially, parallel computing improves performance by breaking up the task and working on the pieces simultaneously. Performance is improved either through speeding the process (doing the same amount of work faster) or scaling up (doing more work in the same amount of time). In most cases, a data warehouse needs both types of performance gains over its life cycle.

There are various flavors of parallelism, and the field is rife with colorful descriptions, such as symmetric multiprocessing; massively parallel processing; shared-memory, shared-disk, and shared-nothing; and cache-coherency. Each implies different problems and approaches, but one thing is certain: As the technology matures, the boundaries of these different approaches will begin to blur. It is reasonably certain that the specific parallelism used will soon become transparent to the purchaser.

Though all the major database vendors support parallelism to a certain extent, their implementations are imperfect and evolving; the technology is not mature. Look closely at a vendor's implementation; the operative phrase is 'caveat emptor.'

For vendors to say they support parallel queries is easy, but what exactly do they mean by that? Do they support parallel SELECT, INSERT, DELETE, UPDATE, and SORT? Under what circumstances? How exactly are queries parallelized?

One approach is to spread executing queries across available processors, each query allocated to a single processor. This technique, known as interquery parallelism, is effective for transaction processing.

But decision-support queries are generally larger and more complex and can benefit from being broken down into subtasks which are executed in parallel. This is called intraquery parallelism. Not every database supports intraquery parallelism, especially in shared-memory models. First, it is far more difficult to manage, and second, the implications for the query optimizer are vast, since intelligence about parallelism must be built-in to an already-complex process.

In another compromise, Oracle will not parallelize a query unless there is at least one table scan.

This is an odd situation, especially in a complex query in which using an index can find the data more effectively than doing a table scan, and the query is still large enough to benefit from parallel execution.

Another problem occurs when the query optimizer spreads a massive query across all processors but makes no provision for the other queries either executing or waiting in the queue.

This is a common problem in parallel systems, which I refer to as the latency-versus-throughput conundrum. The solution is better resource monitoring and management tools, as well as better query optimizers.

How are queries actually parallelized? Let's make some simple assumptions to illustrate how intractable these problems can be.

Assume that a query requires some table scans, the joining of tables, and a sort operation on the result. In a uniprocessor system, the query optimizer would decide on the join type and order and select the appropriate index for non-scanned tables. The engine then would scan the tables, perform the joins, and sort the result, one step at a time. In the diagram at left, this serialization of the three steps yields the longest elapsed time.

One technique for more efficient parallelization of the query is to start the join step before the scan is complete and to start the sort step before the join is finished. This technique is known as pipelining.

How Parellel Queries Affect Performance

pipeline.gif (9626 bytes)

A query's table scan, join process, and sort are normally executed serially.  But a database that supports parallelism can overlap or pipeline those operations to speed up the query.  Partitioning the query into smaller subtasks and pipelining those subtasks produces the best performance.

DATA: ARCHER DECISION SCIENCES

 

On its own, pipelining provides no real intraquery parallelization and assumes that the output of one process, such as the scan, is immediately available as input to another.

This is not a given in most systems. Still, the more the database engine can overlap the three processes, the faster the query will run.

A third approach for parallelizing queries is to partition the query and allocate separate processes (which can be operating-system processes, threads, or database threads) to physical partitions of the tables.

There are variations in this approach, too-especially in the way the partitions are devised. Two methods are broadcasting and symmetric partitioning. In the first, one of the tables in the join is divided in an arbitrary way, and all partitions are applied to all of the rows of the second. In the latter case, the first table in the join is partitioned based on the join key-(all January data, then February, etc.)-hence the name symmetric.

Effect Of Skew On Performance

pipelin2.gif (9945 bytes)

Almost all the benefits of partitioning are lost when one partitioned process takes longer to execute that the others.

DATA: ARCHER DECISION SCIENCES

 

Neither approach offers better performance in all cases, but each carries implications on the types of joins that can be accomplished, such as outer joins. The diagram on p. 87 shows that theoretically, a pipelined and partitioned query could deliver the fastest query-response time. But also notice that the horizontal axis is labeled "Complexity." Flawlessly and efficiently executing these queries every time, especially with multiple queries processing simultaneously is, sadly, beyond the capabilities of all the released products.

One particularly difficult problem for the RDBMS to sort out is load balancing. What if one partitioned process has to do more work than the others? Since this is a common problem, it already has a name: skew.

In the diagram above, it takes only one process with more work to do than the others to eliminate almost all the benefit from the partitioning scheme.

Data warehouse load processes are good candidates for parallelism, and most databases have made excellent progress in parallelizing the loading of data and creation of indexes. Referential integrity checking, automatic maintenance of aggregations, and even the submission of the load job itself are areas that still need work.

Closing The Gap

To be successful in a field mastered by multidimensional databases and relational/OLAP servers, the relational database vendors need to perform some fast gap analysis and beef up their products accordingly. Unfortunately, they seem to have a better handle on the implications of this business opportunity, rather than on the complexity of the task.

------------------------------------------

Neil Raden is the president and founder of Hired Brains, Inc., and international data warehousing consultancy with offices in California, Chicago, Toronto, and New York.  He has authored dozens of magazine articles, books, and white papers on information technology implementation, decision support, data warehousing, and OLAP, and is a frequent speaker on these topics.  You can reach him at nraden@hiredbrains.com .

The author, Neil Raden, pictured with some formerly famous celebrities.


homeoff.gif (237 bytes)  servoff.gif (356 bytes)  knowon.gif (598 bytes)  compoff.gif (406 bytes)  contactoff.gif (405 bytes)

Copyright © Hired Brains Incorporated, 2002. All rights reserved.
Webmaster@hiredbrains.com