![]() |
Articles |
|||||||||||
By Neil Raden Teradata's unique approach to data warehousing calls for front-end tools to be evaluated in a different way. Teradata is different. Unlike "typical" data warehouses--especially those constructed from off-the-shelf tools not specifically designed for that purpose--Teradata warehouses are more often than not tactical point solutions. They incorporate the abilities to store and query vast volumes of data, update the data warehouse in real time, and connect with operational systems. This difference is reflected in the reporting and query tool industry: Although a great deal of market attention focuses on low-end tools for data marts, the most dramatic returns on investment usually occur at the top of the food chain, a realm dominated by Teradata. These closed-loop decision-support applications include the well-known giant-retailer continuous replenishment systems, airline yield-management systems, and many variations of database marketing management systems. What differentiates Teradata from other RDBMSs is its unmatched ability to scale across the entire range of decision-support system (DSS) applications--from simple fixed reporting (though "simple" may actually imply access to large data volumes by thousands of users) to OLAP to data mining to hybrid or closed-loop decision support. There is, however, a downside here: Most of the information about choosing front-end tools for data warehouses is at least slightly "off" when Teradata is concerned. Certain approaches popular with the "merchant," or general-purpose RDBMSs, fail to take advantage of Teradata's unique set of capabilities. Rather than survey the entire range of tools available--a vast category that includes practically every product that can communicate with an RDBMS--I'll focus on matching Teradata's strengths with certain classes of tools in a few specific types of applications. First, I'll describe those elements of Teradata that make it unique from, or at least significantly better than, its competitors. Next, I'll divide the decision-support space into a few categories and describe some typical applications. Then I'll match certain selected tools with these applications and explain their strengths and weaknesses in a Teradata data warehouse. Finally, I'll describe how Teradata is changing to adapt to the market now that DSSs are so popular. Like any other tool, Teradata isn't always applied to those tasks for which it is best suited. In conversations with vendors, clients, and analysts, I found many cases where Teradata was either being throttled by inappropriate tool selection or, conversely, where the application didn't require the power of Teradata. In the latter case, a perfectly appropriate front-end tool for solving the business problem was underperforming due to a poor fit with Teradata. Naturally, you should avoid both situations. What's the Difference? In this article, I'll assume you're familiar with the unique aspects of Teradata's architecture; I'll provide only a brief over-view here. Teradata has spawned a lot of imi-tators but has maintained its advantage with an almost single-minded focus on multiprocessor scalability. The largest data warehouses, in terms of both data volumes and numbers of users, are built in Teradata for a reason: In addition to pure scalability, many Teradata customers cite its reach, or ability to span a broader range of applications than just query support by incorporating analytical and operational features. Teradata achieves scalability primarily through two factors: parallelization of processing and design fit. Unlike other RDBMSs that support every platform--from massively parallel processors (MPPs) to symmetrical multiprocessors (SMPs) down to single processor servers and workstations (some vendors are even planning to support palmtops, TV set-tops, and appliances)--Teradata is designed solely for the multiprocessor (SMP and MPP) environment. Every feature in the system is designed to exploit this ar-chitecture. Other RDBMSs have parallel capabilities, but only Teradata includes the complete set of parallel table scan, index scan, indexed read, sort, join, row redistribution, parse step, pipelining, catalog read, index creation, insert, delete, load, and backup-and-recovery features. This complete integration of parallel processing capabilities, however, ensures that there are no "small" Teradata applications. Without a fair number of system resources to leverage, including processors, memory, and storage channels and devices, not much gain is possible over more conventional approaches. Design fit is achieved only by spending R&D dollars on functions and processes needed to support large-scale data warehouses and DSSs. Decision support puts demands on a database in four areas in which OLTP-tuned databases perform poorly: joining tables, especially large ones; aggregating data; sorting data; and scanning large volumes of data. Consequently, the data warehousing discipline has adopted a set of workarounds to solve these problems (see Table 1). These techniques are now so common that it is easy to lose sight of the fact that they are, in fact, workarounds. As a result, the very nature of data warehousing has been redefined to fit only those capabilities that these approaches can address. That's good news for the merchant databases and the tools that love them, but it can create a gap in understanding about design fit. For example, the recently respectable star schema is now applied in most decision-support applications, but it is incapable of solving certain problems, such as market-basket analysis, economically. Creating aggregated data, keeping tables in sorted order, and maintaining indexes all impede the ability to update the data warehouse online, which is essential for certain types of applications.
Teradata takes different approaches to these challenges. Perhaps the most important point here is that Teradata's query optimizer is "content with a view." In other words, a normalized schema with a light index can exist at the physical level, providing a great deal of flexibility--especially for updating--but the interaction with the access tool can be controlled with an unmaterialized view. This approach provides the best of both worlds (OLTP and decision support) and lets you construct views with the access tool in mind. A major drawback to many data warehouse access tools is that they impose their particular requirements on the physical de-sign of the data warehouse, limiting the number of tools that you can apply. Because Tera-data's optimizer "thinks" through the view and builds the plan based on the physical schema, it provides two distinct advantages. First, there are often many ways to navigate the database in third normal form (3NF) designs--that is, many query paths--and it is possible for a well-formed query to return an apparently reasonable yet completely inappropriate (and wrong) answer. Being able to frame the question through a dimensional model (the unmaterialized view) preserves the advantages of the 3NF design without the accompanying semantic dissonance. Second, it lets the tool write SQL in the tool's dialect, then translates it to match the underlying schema. This capability preserves performance at the database level without sacrificing depth of analysis at the tool level. By concentrating on data warehousing and decision support, Teradata supports the left side of Table 1 directly, without work-arounds, through the application of parallelism, intelligent query optimizers, and core algorithms (see Table 2). Consequently, a certain question arises here: How do front-end tools--most of which are designed to work with Teradata's less powerful RDBMS cousins--take advantage of these unique Teradata features? Let's find out. RDBMS challenge: Joining (large) tables Teradata solution Integration of parallel features allows for fast joining of large tables as well as queries with many tables in a join; query optimizer cognizant of DSS query across range of physical schema Implications No need for a star schema at a physical level (though unmaterialized views improve usability); more "normal" schema designs supported RDBMS challenge: aggregating data Teradata solution Fast group-by processing through partitioning and DSS-scale algorithms Implications Ability to perform without aggregate data, allowing for application flexibility (such as online update, fast dimensional change) RDBMS challenge: sorting data Teradata solution Parallel sort built in Implications No need to pre-sort, therefore faster updates RDBMS challenge: scanning large volumes of data Teradata solution Parallel scanning Implications Minimizes need for indexes, enabling more OLTP-type updating SPECTRUM OF DECISION SUPPORT The range of applications supported by a data warehouse is broad, and subdividing it into categories is always a risky business. The edges of the categories tend to blur, and no single approach is a satisfactory model for evaluation in all cases. One useful approach, however, is to consider how the stakeholders of the data warehouse interact with the information. For example, you can evaluate whether they explicitly request, or "pull," information or whether it is distributed without an explicit request, or "pushed." The actual content may vary by time, but the logical content may remain invariant, such as with a year-to-date sales report by district. In some cases, people may actively explore the data, navigating analyses that are refined with each iteration; in other cases, people step aside and let the machine do the exploring--looking for patterns, aberrations, or clumping that may not be intuitively obvious. The concentric semicircles in Figure 1 depict this range of interactions with a data warehouse. The outermost circle is a designation of either "No Context" or "Context," an important distinction in the serial ergonomics of front-end tools. A static display of information, whether tabular or graphical, gives no clue about the steps that preceded it unless it incorporates this sense of context. With many tools, you can represent iterative passes through the data as tiled displays, in indented outlines, or with any number of creative visual clues that convey the steps taken. In the online interactive model, each step or question is related to the ones that precede it; without a "trace" capability, the thread that holds the interaction can be lost. For this reason, it is useful to understand how a tool will be applied. The actual data reported in a complex OLAP tool may be identical to that presented in a static report; it's the process that matters.
REPORTING AND ANALYSIS Two common uses for data warehouses are the creation and distribution of reports on a production basis and the interactive creation of reports. The difference between "fixed" reporting and "active" reporting rests solely in the way the report request is generated. Fixed reports are coded in advance and run at scheduled times in certain sequences or job streams. You can code them entirely by hand in SQL (thankfully, this approach is rapidly losing favor--a blessing for stakeholders as well as coders), or, more commonly, generate it with graphical tools and store it in the same manner as you would a hand-coded report. Active reports require a live connection to the data warehouse, typically through either a client/server architecture or, increasingly, through a Web browser. The requester can interact with the report at least three ways:
Because many business problems can now be at least partially addressed with these tools, hand-coded, optimized SQL is becoming a scarce commodity. However, most of these tools lack SQL generating en-gines with enough sophistication to address the more difficult questions. For that reason, many of them extract data from the database (often in frighteningly large quantities) and assemble it on nonrelational servers or workstations, even on the client itself. Almost every tool in this category operates in this manner. Clearly, this method is a waste of re-sources and a horribly inefficient way to solve a problem. Pablo from Hummingbird Communications (formerly from Andyne) and PowerPlay from Cognos operate in this manner, creating "cubes" (multidimensional data structures) on the client workstation or on a middle-tier server. Each of these vendors also offers a query tool--GQL and Im-promptu, respectively--that dispatches SQL to the server and formats the final result set, storing no data locally. Hummingbird GQL consistently shows up in Teradata implementations. Although the reasons are partially historical, (GQL was among the first point-and-click SQL generators to support asynchronous query, which I'll soon describe), GQL is able to generate reasonably complicated analyses, and Teradata doesn't seem to care whether the query is optimized or not--the query optimizer rewrites it anyway. Although these managed query tools are nicely suited to Teradata SQL rewriting capabilities, they lack most of the dimensional intelligence of the OLAP tools (more on this later). Response time is another interesting issue that always rears its head. Typically, response time is a measure of the elapsed time between asking a question and getting a response. The problem is that the "right" level of performance is mostly subjective, hard to anticipate, and harder still to explain. What is the value of a system that returns an answer in four hours? How do you compare a long wait for a question such as "How do I get rid of this excess inventory by the end of the week in a way that maximizes profit?" to an instantaneous answer to "How many chainsaws did we sell in Texas last year?" Enter the asynchronous query feature: the ability to launch a query interactively and immediately have control of the workstation to launch another, getting results as they finish. The "right" response time is usually a matter of perception (except in certain applications, such as air traffic control or pacemaker circuits), and asynchronous query support can extend the usefulness of a tool. This ability is particularly important with Teradata--which is at its best when answering difficult and interesting questions by mobilizing detailed data that other databases can only store. OLAP The difference between active report tools and simple OLAP tools is fairly obvious. OLAP implies a certain kind of interactive navigation through data, drilling up, down, over, under, and through. The "path" of this navigation goes through levels of hierarchies or aggregations of data. The context of interaction, however, is OLAP's distinguishing feature. Each result in an OLAP tool is a launching pad for more questions. No matter how "active" an active report is, it's still just a report--if you want to ask another question, you need to start over. Such is not the case with OLAP. Simple OLAP tools are becoming quite ubiquitous; the aforementioned PowerPlay and Pablo are two examples. What separates a simple OLAP tool from its high-end cousin, the complex OLAP tool, is that it lacks a host of features for solving more difficult problems, such as:
Tools in the complex OLAP category include the relational OLAP (ROLAP) tools from MicroStrategy, Information Advantage, and Platinum and the so-called hybrid OLAP (HOLAP) tools from Oracle (Express), Seagate Software (Holos), and Speedware (Media/MR). Not included in this category, however, are the tools that use the data warehouse as a storage medium only, such as the multidimensional databases from Ar-bor Software (Essbase, although it is moving rapidly into the HOLAP space), Pilot, and others; specialized applications; and statistical packages. One important aspect to consider when evaluating these tools is just how the OLAP engine goes about solving problems. Does it support outer joins, SQL-92, or the database's proprietary extensions to SQL? Or, in contrast, does it require you to develop SQL code by hand and embed it in the tool, either in metadata as text or directly in coded programs, scripts, and macros? Does it solve problems in steps, or all at once? There are two basic models for addressing these questions. MicroStrategy relies on the database exclusively for answering questions by generating SQL dynamically for every question. In a way, this approach is the most appropriate use of the power of Tera-data for many OLAP queries. Other ROLAP tools, such as Information Advantage and Platinum's InfoBeacon, process some data in the database and some in their own engine, and each allows for a certain amount of tuning of application partitioning; some HOLAP tools, such as Media/MR, use a similar model,. If you take the position that an RDBMS--even if it's Teradata--is not designed for every OLAP question, you would find this model a more appealing prospect. But in my mind, the jury is still out. The HOLAP tools all have their own scripting languages, or 4GLs, so regardless of how vigorously the vendors argue otherwise, they still require a fair amount of SQL handwork to achieve reasonable performance and functionality. But again, things are a little topsy-turvy in the Teradata world; what is usually considered a detriment in standard tools may be an advantage. Given Teradata's power (and its new extensions to SQL for decision support), a little hand-coding may go a long way. One last issue to consider is the use of temporary tables. Although MicroStrategy pioneered the use of multiple SELECTs and temporary tables for solving OLAP problems, many other vendors are quietly starting to copy this technique, even though they continue to harangue about it. Teradata is improving its support for temp tables in two ways in V2R3. Global temporary tables are predefined in a table template in the schema so that a session can use a private instance without having to generate data dictionary language (DDL). Volatile temporary tables still require a DDL but carry none of the overhead of permanent tables. (Red Brick and IBM have both added vastly improved temp table support in their current releases as well, a clear indication that the market has accepted the approach as a valid one.) Teradata is also extending SQL with operators such as ranking, top(n), bottom(n), running sums, cumulative totals, moving average, and linear regression. This fact also argues in favor of SQL-generating tools that can quickly adapt to SQL dialect enhancements. A tool that forces you to move data to its own server, even for this type of complex processing, may be a poor choice in a Tera-data environment. Closed-Loop Decision Support Technically, closed-loop decision support is an effect, not a product grouping, that de-scribes the interoperability of analytic pro-cesses with operational ones--for example, the ability to explore a customer database to uncover desirable (or undesirable) attributes and to execute an action, such as the creation of a mailing list, immediately. In our practice, we consider all decision-support systems incomplete unless they can "close the loop." Analysis of sales results is only useful to a point; the engine of value creation in DSS is linking the analysis to discussion, consensus, action, and results in a systematic way. This is a nice thought, but does it work in the real world? The concept of a data warehouse as a read-only repository of historical data is antithetical with the closed-loop concept. Perhaps closed loop stretches the definition of data warehousing, but it doesn't change the fact that it is a fertile area for leveraging the data warehouse investment. Unless the data warehouse application can access and update realtime data, there is still a gap in the loop. One approach is to access the data warehouse from an application but use another database for capturing the realtime actions associated with it, updating the data warehouse in short in-tervals (the so-called operational data store, or ODS). While this technique may have a certain architectural appeal, especially to those wedded to the read-only metaphor, it is clearly antiquated. The read-only approach inherits the shortcomings of merchant databases, which are unable to provide scaling and performance for DSS as well as they do for OLTP. With Teradata's ability to provide both DSS and OLTP features, closed-loop can be implemented from a single platform. Unfor-tunately, few packaged applications are available for this goal, so meeting it still requires "rolling your own." The bottom line is to avoid packaged applications that are not designed for Teradata and separate the DSS work from the operational work. This approach is particularly warranted for database marketing applications that separate the counting, profiling, and scoring functions (DSS) from the mailing list and campaign management functions (OLTP). Similar situations exist in supply chain management, logistics, replenishment, order management, and a host of integrated verticals. The Face of Teradata All the tools that take advantage of Teradata's unique capabilities share a few qualities. Here are some rules of thumb: First, favor tools that "celebrate the complexity" of decision support instead of trying to reduce it. As a result, you should select tools that do not transport data from the Teradata environment. You should also select tools that generate SQL from a business view, not a database view, and incorporate the ability to construct complex, multistep questions. Second, be sure that the tools don't compromise Teradata performance certain approach. And third, there are no canned or packaged products available for closed-loop DSSs, so if you choose to build an application with components, ensure those components make sense with the Tera-data architecture. -------------------------------- Neil Raden is president and founder of Hired Brains Inc., an international consulting company with offices in California, Chicago, Toronto, and New York that specializes in data warehousing and business intelligence. You can reach Neil at nraden@hiredbrains.com or (805) 886-8892. ------------------------------------------------------------------------ Teradata and Tensor This is Microsoft's year to enter the data warehousing market, and it looks like the next shoe will drop later in 1998 with the release of SQL Server 7.0, the Plato OLAP server, and a version 2 release of OLE DB, including OLE DB for OLAP (code-named Tensor). What exactly is Tensor, and what is its impact on tools for Teradata? First a little background. OLE DB is a set of COM objects designed to standardize access to all kinds of data. For example, the content of a PowerPoint presentation is rather different from the content of an SAP table, but at a certain level, accessing both has certain things in common. OLE DB allows a reader of data--a "consumer" in API parlance--to use these interfaces to access data; OLE DB providers expose their data by implementing these interfaces. As rumor has it, NCR will soon be offering Tensor-compliant access to Teradata data through a multidimensional database (MDDB) server called Teracube. This puts NCR on a collision course with the MDDB vendors--including Microsoft, with which it has just inked a technology partnership agreement. However, Teracube will initially be targeted at the high end, especially the parallel processing world, where Plato won't play for some time. (Could it provide the physical storage and access for a "virtual" Arbor Essbase database? Sure it could.) The existence of Teracube raises an interesting issue. Creating a Tensor-compliant "consumer" requires a brand-new language called MDX, which is just as unlikely to be embraced by the nontechnical masses as its relational cousin, SQL. It also poses real problems for vendors, who will have to create MDX-generators tuned for databases. But here's the dilemma: The combined license revenues of the two leading MDDB vendors, Arbor and Oracle, constitute perhaps one-tenth of the market for RDBMSs in the decision-support space. Where would you rather spend your R&D dollars: on perfecting a SQL generator, or on perfecting an MDX generator? Stay tuned. ------------------------------------------------------------------------ This article appeared in the Summer, 1998 issue of Teradata Review and reprinted here with permission. Copyright © 1998 Miller Freeman Inc. All Rights Reserved |
||||||||||||
Copyright © Hired Brains Incorporated,
2002. All rights reserved. |