![]() |
Articles |
|||||||||||||||||||||||||
The right storage selection is critical for high performance of your data warehouse By Neil Raden, Archer Decision Sciences and Michael Peterson, Strategic Research Corp. The most rapidly-improving and most often-overlooked area for performance improvement in data warehousing is storage. No longer just disk and tape drives, storage arrays have evolved into complete, intelligent I/O channels and subsystems, every bit as sophisticated, complex and expensive as the other components in the data warehouse. Ironically, while data warehousing routinely sets and breaks the high-water mark for large databases, discussions about the relative strengths and weaknesses of storage options is strangely absent in the industry. Neither in the literature, nor from the mouths of the "gurus" at the data warehouse conferences, references to storage remain something of a mystery. From a practitioners perspective, though, it is a real enigma. We see storage as key to achieving performance. Do you? And do you understand the issues in selecting the right architecture for data warehousing? Do "write-back caches" or read-ahead caches improve or degrade data warehouse performance. Are higher capacity drives a benefit, or is the decrease in the number of separate devices a drawback? Is RAID necessary? If so, at what levels? Can a data warehouse be backed-up? Is the movement to centralized, multihost, shared storage, led by EMC Corp., allowing a single array to service mainframes, minis and servers, a boon or a bust for warehousing? And when it comes to cost, should capacity be added, well in advance of the need or just-in-time? If there were only a limited number of options, or if the technology were relatively static, or if the economics of storage were stable and predictable, it would be understandable why so little ink is devoted to the topic. However, none of those conditions apply. Storage options can span a price range of an order of magnitude or more. The prices of storage are both dropping and climbing, simultaneously, at an alarming rate. Current prices today range from $2.00 to $.20/MB of storage, and will probably drop 50% in the next year or two, especially at the low end. At the high end, specialized disk array storage systems, complex multi-processor computers with associated specialized hardware and software, are climbing as features are piled-on. In many cases, storage is the single largest line item in the data warehousing budget. Most importantly of all, the wrong choice can damage or even doom the entire project! In many cases, storage is the single largest line item in the data warehousing budget. Most importantly of all, the wrong choice can damage or even doom the entire project! Despite all of the alluring, potential benefits of high-end disk arrays, they are very costly compared to purchasing single disk drives and an enclosure. Yet, there is much more than just out-of-pocket cash involved. In information economics, you dont pay for the level of functionality over time, but rather the rate of change of the functionality.. With a high-end disk array, you are paying for the ability to grow and scale the solution without rearchitecting it every time you need to add incremental performance or capacity. Long-term success in the data warehouse arena is associated with high performance, scaleable, intelligent, and modular storage solutions, not "bit-buckets". This article discusses the implications of the "enterprise storage" market for data warehousing specifically. Though certain terms are introduced within the context of a particular requirement, this is not a technical review of all aspects of storage per se. For a good review of the details, including mirroring, duplexing, RAID, striping, latency, MTBF, failover, transfer rates, read-ahead cache, write-back cache and all of the other aspects of the physical media, see Paul Massiglias article1. Is Storage Important? Storage really is important and has many benefits to the business of operating a data warehouse. Storage not only holds the data, it is a key component of the architecture, enabling enhanced performance, uptime, and access to information.
Database performance, especially in data warehousing, is poorly understood. Even the basic principles have yet to be rigorously proven. The only working benchmark is the TpcD test, and there is plenty of controversy about whether it is at all representative of most types of data warehouse applications. Furthermore, few vendors offer storage solutions optimized for data warehousing. With these few exceptions, the market is full of general purpose disk arrays. The tuning characteristics of warehouses, taken together with the connected components of the data warehouse architecture (servers, databases, networks, clients, middle tiers), present a complex problem without broadly known rules-of-thumb. Where do we start? The first step is to examine the peculiar requirements of data warehousing and to develop a of list critical considerations. This provides at least a rudimentary locus of evaluation of different approaches. Next, the context of performance must be framed. Storage performance in data warehousing is not achieved in a vacuum. The issues of availability, data protection, and cost run in parallel to performance. The next article in this series will deal specifically with data protection strategies for large data warehouses. Data Warehouse Requirements Data warehouses can exhibit performance and usage characteristics across a wide spectrum. Some are read-only, and serve only as a staging area for further distributing data to other applications, like data marts; others are read-only, but are designed for fast, on-line query processing, often very complex in nature. And, of course, there are the "write-only" data warehouses, behemoths that serve no real purpose except for bragging rights about having the biggest database on earth, but we arent concerned with those! Other data warehouses contain elements of OLTP systems, though most purists deny this is the case (such as planning and forecasting, where "scenarios" are created interactively and saved, or catalog retailing systems that profile customers as they call in, then capture the interaction in real-time). Whether you call them data warehouses or not, they exhibit different performance characteristics and demand designs optimized to their purpose. In some cases, all of these elements are present. The most common situation is a single database that is designed to support "canned" queries, which allows for a certain amount of tuning, but is combined with open, ad hoc access by a smaller number of "data surfers." What we can generalize about data warehouses is that they are large, they typically deal with data in very large chunks, usage patterns are only partially predictable and everyone involved with them is emotional about performance. Beyond these ambiguous characteristics, two features of data warehousing define operational performance: query performance and load performance. Two other elements that influence the purchase decision are cost and system cache. These are discussed to complete the picture. Query Performance Data warehouse query activity follows certain patterns. Obviously, activity is dependent on the concentration of users within time zones: if there is a large concentration in a few time zones, there will be peaks at certain times of the day. Also, usage is greatest immediately following an update. Beyond that, certain time cyclicities are evident depending on other factors: financial databases have their heaviest use either just before or just after the close cycle, or both. Promotional planning is an annual process, and usage rates are highest in the planning stages. New product launches can ratchet usage up a few notches, too. Data warehouse queries are complex. Instead of finding a needle-in-a-haystack, these queries can range over the entire database, often in multiple passes, assembling temporary sets that are joined or merged for comparison. Derived values are calculated for thousands or even millions of records per query. The demands placed on the database server stress the CPU, the memory (and the bus) and the I/O system. If one piece is overworked, the rest perform sub-optimally. The I/O system is comprised of many elements; including the bus, memory channel, drivers, adapters, interfaces, controllers, and disk subsystems. When talking about storage, it is incorrect to assume that the disk subsystem is all-inclusive. High performance disk array architectures include all these components outboard of the server bus. Assuming there is adequate CPU and memory compute power available to process queries, the disk array plus the I/O channel bandwidth hold the keys to improved performance. Databases make extensive use of temporary space to perform sorts and aggregations during parallel queries, particularly in cases where multiple queries are executing concurrently. The good news is that the disk space allocated for this does not need to be saved, therefore it can be dropped from the database and recreated in case of device failure. It is also evenly split between read and write activity, and performance in this area is extremely critical. The use of RAID 1 and 5 devices for this space is a poor choice because of cost and performance considerations. Sort and aggregation performance is a critical performance issue and a good fit for adaptive RAID and volume management techniques. The idea is to automatically allocate temporary space to a fast-write configuration such as RAID 1 (no redundancy is required because this is temporary data), leaving the rest of the data protected by redundancy (RAID 1 or 5 type configurations). Load Performance Data warehouses run all of the time. Even if the people who use them are asleep, their queries are as busy as the cobblers elves. This isnt the way it was supposed to be. Data warehouses were supposed to be "quiet" during the day (meaning, no updating going on), and refreshed at night, when they were closed for business. In fact, many warehouses were (and still are) designed to be refreshed only weekly, monthly or, in some cases, quarterly or annually. That is no longer the general case. Decision support systems based on data warehouses are dynamic tools whose volume sizes have increased dramatically. Two years ago, a "large" data warehouse housed about 10-20GB of raw data. Today, large projects involve a few hundred gigabytes (the multi-terabyte data warehouses are far less common than most people expect). The amount of raw data is only the beginning. Expect the total size of the database to grow four to eight times the size of the raw data. Add to that working space for complex queries, which is substantial, development and test environments, staging areas for data-in-transit, and much more pre-processing than there was even a year ago to accommodate the large load volumes, and it becomes obvious that even 100 GB of data can require ten times that much storage (without "redundancy"). In many cases, updates can involve tens or even hundreds of gigabytes in a single load. Massive replacement of database records is usually an inefficient process, so partial or even full reloads are often performed. These amounts of data moving from one machine to another across a network can soak up all of the available bandwidth, causing other, concurrent processes to slow down. Load performance is crucial in these large environments. The new enterprise storage systems with their dedicated storage area networks offer some relief to this problem, as the following copy from one vendor demonstrates: "EMC DataReach is first-of-its-kind software that makes information stored in mainframe-resident applications easily accessible to open systems, around which many data warehouses are built. Using EMC DataReach software, system administrators are able to quickly extract, transfer, and load MVS/ESA DB/2® files into Oracle® databases -- without any impact to mainframe application performance or network resources. EMC DataReach is a key component for refreshing data warehouses as rapidly and frequently as business requires." Cost Considerations Costs per unit of storage are at an all time low, and market forces are driving the prices down at an increasing rate. It is common to find prices in the $.25/Megabyte range for JBOD disks in simple non-fault tolerant enclosures to >$1.00/Megabyte for high-end storage subsystems provided by vendors such as Clariion, MAXSTRAT, and EMC Corp.. Two factors are at work forcing the price of storage down even further. First, continuing advances in storage technology increase the density and speed and devices without increases in cost. More importantly, however, is the effect of economies of scale: the demand for more and more increasingly large databases is driving the demand for drives up, leading to an overall increase in manufacturing capacity and lowered unit costs. Based on these factors, many industry experts are predicting that costs will drop to $.05/megabyte for JBOD disk subsystems by 2000. That represents a decrease of roughly 50% every year. Regardless of the cause of declining disk prices, the maturing data warehouse in a organization exhibits rapidly increasing volumes of both data and concurrent users. Following this line of reasoning, it is obvious that the incremental disk space required in two years, is much greater than the total amount needed right now. For that reason, coupled with the dropping prices of devices, it is clear that having excess disk capacity is a very expensive resource, especially when viewed over time, not at current prices. This means that a storage system that allowed you to add storage in increments only when needed could save you over $50K per year assuming you were able to defer the purchase of 50GB. A fully mirrored database that grows from 40GB of raw data to 200GB in three years, sustaining the storage capacity and throughput requirements, will require an I/O subsystem with roughly 700GB of space. This occurs because the combination of data space, doubled for mirroring, sort space, extract space, temporary space, etc. creates demand for over 3 times the actual data size. With disk prices dropping at a rate of 50% per year, staging the increments over the whole period can save well over half of the cost. That is the simple view, forgetting associated network and management costs. The actual cost savings are greater. Remember that you dont just buy disk drives. The proper purchase is to originally buy a scaleable, modular system that can grow in small increments with your requirements, including provision for increasing bandwidth as well as capacity. Caching Many of the advanced systems employ a series of caches to reduce the number of slow mechanical "seeks" of the disk drive heads, instead reading directly from volatile memory at bus speeds. The most common of these caches is the "read-ahead" cache, where the storage system gathers contiguous data when reading from the drives. While this is fine for serial unloads, the random nature of most data warehouse queries is poorly served by this approach (and it is of no benefit for serial loads). The troubling aspect of read-ahead disk caching is that a fully-architected data warehouse environment has caches in many places. Many of which are redundant and working against one-another.. Figure 1
In this figure, a typical application is depicted, consisting of a client workstation connection to a middle application server, which draws data from a database server, which is connected to an I/O device that physically houses the data. Here are some possible scenarios, describing how cache works:
The alarming part of this whole process is that:
Recommendations for achieving Performance in Data Warehouses A principal we began with is that there are three legs to this stool2: performance, availability, and data protection. All three are required to achieve a highly accessible, productive application. Take the case of a 200 GB raw data system, with a 700 GB storage requirement. It uses on the order of 175 - 4GB discrete disk drives. In a system with such a high component count, device failures occur biweekly, on average. A fault tolerant3 disk array capable of hot swapping all components is required as well as a strong service relationship with your supplier. This relationship is critical and a very important consideration in the purchase decision. Fault tolerant arrays have another benefit. It is particularly unfortunate that many device failures occur in the "infant mortality" period, which usually happens on system rollout, causing schedule slippage, customer frustration, and a poor initial impression of the system. An advantage of purchasing a fault tolerant system with a good service arrangement is that you can ignore the infant mortality problem. Because failures are detected and automatically swapped out to hot spares, you do not have to burn in the system, allowing you to get into production faster. Time is critical at this stage of application development and any time saved is valuable. Figure 2 summarizes the key features to look for in selecting a storage system to support a data warehouse. Buy a fault tolerant-class system based on a modular architecture that scales I/O bandwidth as well as disk capacity in small increments. Purchase only as much disk as you need since pricing is falling so fast. Select fast (Ultra SCSI, FC-AL, or SSA) disks for maximum I/O performance. Invest in RAID for availability and data protection as well as read performance. Use a volume manager or adaptive RAID technology to load balance different database activities. Make sure there are multiple disk controllers as well as room for many network connections using high speed interconnects such as Ultra SCSI, Fibre Channel, SSA, or HIPPI. If you need 1 GB/s transfer rate and are using Ultra SCSI drives, you will also need many Ultra-SCSI host interfaces operating in parallel. The way this works is this. First, assume a 9.1 GB disk has an average transfer rate of 8.5 MB/sec. Also, assume you are using VERY efficient command tagged queuing on the SCSI bus AND you are multiple requesting with asynchronous queued I/Os outstanding AND that they are evenly balanced over all the data disks perfectly, AND you daisy chained no deeper than 4, you might sustain 30-32 MB/SEC over an Ultra-SCSI bus. In real life no way! Transfers at that speed typically require multiple hardware striped raid controllers. So, the solution is multiple disk controllers for performance, availability, and to allow the write cache to be mirrored for protection. Other things to consider are to not mix database applications on a single multi-host storage server. Keep them on separate storage servers if possible because of the differences in requirements of write-intensive operations versus read-intensive. You should also invest in management tools. Fault tolerant, enterprise class disk array systems should come with centrally administered device management tools that allow remote monitoring and administration of performance, availability, utilization, and errors. Figure 2
Which RAID levels should you use? Table 1 discusses the comparison of how RAID levels and other array features apply to a traditional OLTP application versus a data warehouse. RAID is appropriate for a data warehouse for availability, data protection, and performance reasons. A combination of strategies can insulate the system from individual device failure. The read-mostly nature of data warehousing is compatible with all levels of RAID and the write performance overhead can be mitigated by sufficient write cache. In the case of disk mirroring, the only real penalty is cost, and the benefit is reduced access time with improved availability. Table 1
Summary Data warehouses no longer exist in isolation, or at the end of a unidirectional flow. But, just as the tuning characteristics of the relational databases and the parallel servers were originally developed around OLTP applications, most enterprise storage array systems have the same roots. Do not assume that gains achieved with OLTP applications can be duplicated with data warehouses. Also, be especially careful about consolidating data storage into a single "hotel", which can have a beneficial effect for the entire organization, at the expense of a single application, particularly the data warehouse. Storage performance is a network problem, not just a disk problem We suggest investing in high-end fault tolerant storage systems, not dirt-cheap bit- buckets because they can be optimized for data warehousing applications. Database administrators are now learning what mainframe and network administrators learned years ago when they solved network performance problems. Its all about the network. If you want storage performance, build sufficient I/O bandwidth to get the data to the compute resources and use enough disk drives and controllers in parallel to supply the data. 1. Massiglia, Paul. "RAID Revolution: The Dawn of Enterprise Storage." Database Programming and Design, June, 1997: 42-49 2. Ref: 'RDBMS Data Protection Practices", Strategic Research Corp. 6/97at http://www.sresearch.com/105461.htm 3. For definitions of classes of arrays, see the RAID Advisory Board, RAB, at http://www.raid-advisory.com/
------------------------------------------------------------------------ Neil Raden is the President and founder of Hired Brains, Inc., a consulting company with offices in California, Chicago, Toronto, and New York, that specializes in implementing Data Warehouses and Business Intelligence systems. He is a practitioner, a well-known authority in decision support and information delivery, widely published and a frequent speaker on these topics. He can be reached at nraden@hiredbrains.com or 805-886-8892. Michael Peterson is President of Strategic Research Corp., a Santa Barbara CA. based market research and consulting firm specializing in network storage and storage management. Michael is widely recognized as the leading industry analyst in these markets. Keep up with his publications and activities at http://www.sresearch.com. Mr. Peterson can be reached at 805-569-5610. |
||||||||||||||||||||||||||
Copyright © Hired Brains Incorporated,
2002. All rights reserved. |