The Database Column

Distribuir contenido
A multi-author blog on database technology and innovation.
Actualizado: hace 4 horas 52 mins

Field Fodder -- Compression in Real World Datasets

Mar, 23/09/2008 - 15:39

With database volumes growing exponentially (see this previous post) and CPUs far out performing disks (see this previous post), compression has become a hot topic among database management solutions. Just don't believe everything you hear about compression. Product marketing is ripe with claims of anywhere from 2:1 to 30:1 compression (Note: A ratio of 2:1 is equivalent to 50% compression while 30:1 is equivalent to 96.6% compression). While these ratios may be true for data cooked up in the lab, real world compression rates will vary dramatically depending on the data in your warehouse and how you load and query it.

Compression is very data dependent. It is expected that "your mileage may vary," but with compression the differences may vary by orders of magnitude. Analyzing applications across industries we have found databases are compressible to different extents.


Compression in the Real World

In financial services, stock market trade data for all US exchanges includes 250 days of data and 10,000 instruments each year. At 100 million trades per day (a conservative number), a dataset that records the date and time, instrument, price, and volume will use between 3GB and 10GB (uncompressed) per day depending on the representation (e.g., binary vs. ASCII). By one year, this raw data will be anywhere from 1-3TB. Using an off the shelf Lempel-Ziv (LZ) algorithm may compress it by 2:1.

In the telecommunications space, call detail records include information describing the call path, times, features, and switches. An average record may be 550 bytes long, and a regional telco may record 500 million per day resulting in uncompressed source data of 275GB per day (or 100TB a year). This data, also using off the shelf LZ, compresses by 5:1. These compression factors relative to raw data can vary dramatically depending on whether the data is preprocessed (e.g., encoding long names into ids, changing timestamps with time-zones into GMT, or replacing empty strings with nulls).  

In contrast, column stores often achieve up to three times more compression, reaching factors of 20:1 versus raw data by isolating variability in the data. In a column store, each block contains data of the same attribute and type, and sorted columns guarantee homogeneity even for trickle loads and high cardinality data. This also allows column stores to use more effective algorithms than vanilla LZ. 

Vertica's customers span many industries and applications, so we have been able to assess compression with a large variety of real world datasets. For example, looking at the compressibility of different datasets that customers load into Vertica, we see the following as typical compression ratios, relative to raw ASCII delimited data (e.g., comma separated values):

  • CDR - 8:1 (87%)
  • Consumer Data - 30:1 (96%)
  • Marketing Analytics - 20:1 (95%)
  • Network logging - 60:1 (98%)
  • Switch Level SNMP - 20:1 (95%)
  • Trade and Quote Exchange - 5:1 (80%)
  • Trade Execution Auditing Trails - 10:1 (90%)
  • Weblog and Click-stream - 10:1 (90%)


Achieving Better Compression

In order to achieve compression rates higher than LZ, Vertica implements a variety of homegrown encoding and compression algorithms specifically designed for a column store database. In addition to optimized block layouts and well-known algorithms, such as run length encoding and delta value encoding, Vertica uses optimized integer and floating point compression algorithms and compound encoding that combines algorithms to increase compression and overall system performance.

Note that Vertica not only reads compressed data off of disk but also processes queries on the compressed data, saving memory and CPU bandwidth (see this previous post). This cannot be done when compressing with LZ. 

Applying LZ to any database, a server with enough CPU and a threaded storage subsystem will reduce I/O at the expense of CPU. The result is performance improvements as high as 30% and space savings up to 7:1 compared to raw data. These numbers approximate the best you can get out of a traditional database, storing records as variable sized rows with headers and applying per-block compression. These numbers also assume that data in any given block is homogenous (e.g., same date, instrument, and customer), which may be the case for bulk loaded data but is not necessarily the case when trickle loading. In a trickle load stream that adds only thousands of records per second, most databases add records into any block with free space, resulting in mixed data values and reducing overall compression. In addition, auxiliary structures such as indexes and materialized views add bloat that is more difficult to compress when these structures need to be updated as records are added.

In a recent head-to-head comparison with a competitor, a prospective customer was able to only test 300GB of raw data on a popular row store because its server was limited to 1TB of disk and the additional structures required to achieve adequate performance consumed three times the raw data size. Loading into Vertica the calculated capacity was close to 4TB of raw data on the same 1TB of disk.

You can use the compression ratio in the table to determine what your storage requirements would be using Vertica. With other vendors, you mileage will most certainly vary, but don't forget to factor in indexes, materialized views, and other auxiliary structures for row stores (such as per tuple headers), which are likely to require 2-5x these sizes.


Debunking Another Myth: Column-Stores vs. Vertical Partitioning

Vie, 01/08/2008 - 18:11
Editors note: This post was co-authored by Daniel Abadi and Samuel Madden

Last week we discussed the myth that a heavily indexed row-store can provide column-store-like performance. In particular, there is a common misconception that a column-store is basically like having a row-store with an index on every column. In fact, an indexed column in a row-store and a regular column in a column-store are very different data structures: an index maps column values to tuple IDs while a column maps tuple IDs to column values. The different data structures are useful in different situations.

This week, we debunk another commonly proposed approach for making a row-store perform like a column-store: vertically partitioning a row-store. Vertical partitioning is a technique that can be used to enhance performance on read-mostly data warehouse workloads. The idea is to store an n-column table in n new tables. Each of these new tables contains two columns - a tuple ID column and data value column from the original table. For example, the three column-table:

 

could be transformed into the following three tables:
 
Each table is clustered by Tuple ID. Queries over the original table are then rewritten into queries over the new tables. For example, the simple query:

SELECT name, city
FROM customers

Would be rewritten into:

SELECT t1.value, t2.value
FROM Name AS t1,
             City AS t2
WHERE t1.tuple_id = t2.tuple_id

Note that a vertically partitioned store functions like a column-store. If only 2 out of the 3 attributes of a table are requested by a query (as in the example above), then only 2 out of the 3 partitions need to be accessed. Of course, a join must be performed to merge these multiple attributes together, but since each vertical partition is clustered by the join key (Tuple ID), the join is not expensive. Furthermore, due to the clustering on Tuple ID, a vertically partitioned data structure is like a column-store in that each partition is a Tuple_ID-to-value mapping (which, as we showed in last week's post, is the opposite of the value-to-Tuple_ID mapping used in indexes).

Hence, a column store and a vertically partitioned row-store are very similar. They have the same set of trade offs relative to normal row-stores (better for read queries, slower for non-batch write queries). Thus, it is only natural to expect the two approaches to perform similarly well on read-mostly data warehouse workloads.

Surprisingly, in practice, this is not the case.

In the same SIGMOD 2008 paper that we discussed in last week's blog posting, "Column-Stores vs. Row-Stores: How Different Are They Really?," we implemented vertical partitioning inside of a commercial row-store on a read-only benchmark. The benchmark we used was the Star Schema Benchmark, a recently proposed benchmark designed to be more "typical" of data warehousing data and queries than TPC-H. We compared the performance of the vertically partitioned commercial row-store with the same row-store under a more normal configuration (optimized by a professional DBA using best practices) and with a column-store. The results are shown in the figure below:

 

On this read-only data warehouse benchmark, one expects the column-store to outperform the row-store. We found this to indeed be the case. However, the surprising result is that instead of the vertically partitioned row-store performing similarly to the column-store as the discussion above suggests it should, it performs an order of magnitude worse than the column-store and a factor of 3 worse than the original row-store!

In the paper, we explore the reasons behind this surprising result, looking in detail at performance on different queries in the benchmark. The following are a list of high level reasons we came across for the poor performance of vertical partitioning on the row-store.

1) Data sizes. Table scans are often the best choice (relative to index scans) for data access in data warehouse workloads. The reason is that, in general, a large amount of data is needed by the query operators (e.g. to perform a summarization or an aggregation). A good rule of thumb is that unless less than 0.1-1% of the tuples are accessed by a query, an index scan will be slower than (or no faster than) a simple sequential scan. However, the time to perform a table scan is directly proportional to the size of the table.

Unfortunately, the Tuple ID column significantly increases the total storage required for the vertical partitioning approach. Since the columns that are accessed from the fact table in data warehouse queries are generally foreign keys (for joins with a dimension table) or are otherwise a numeric type (such as 'price' or 'revenue') that will be input to an aggregate operator, the size of the Tuple ID is generally on the order of the same size as the actual column value data. Thus, the existence of the Tuple ID column basically doubles the size of the table.

Furthermore, each tuple in a table contains a tuple header. Usually the size of a tuple header is insignificant, since the size of the actual data stored in a tuple is generally much larger. However, in the vertical partitioning case, where each table is really narrow (each containing two narrow columns), the size of the tuple header relative to the size of the rest of the tuple is no longer insignificant.  For example, in the open source PostgreSQL database, which is reputed to have a small header relative to most commercial products, the header size is 24 bytes. Compare this to the 8 bytes of actual data per tuple for a two-column vertical partition containing two integer columns. Column-stores avoid the tuple header overhead problem by storing tuple headers just once in their own separate column, and using concurrency control schemes that do not require checking the tuple header on every tuple access.

The bottom line is that the full 17 column fact table in our benchmark took up approximately 4 GB of space after compression. Meanwhile, each vertical partition took up between 0.7 and 1.1 GB of space after compression. In the star schema benchmark, the average query accesses 4-5 vertical partitions. This results in approximately the same amount of data being scanned in the vertical partitioning case as in the normal row-store case (even though only 25% of the vertical partitions need to be accessed per query). Thus, these space overheads completely eliminate the 'efficient data reading' advantage of vertical partitioning!

(As a side note, the row-store product that we used would not allow the use of 'virtual tuple ids' if we wanted to use tuple id as a join key. If the product allowed this, some of the space overhead could have been avoided).

2)  Horizontal partitioning. Under the normal configuration, the row-store product was able to take advantage of horizontal partitioning to improve performance. In particular, many of the queries in the benchmark contain a predicate on the 'year' attribute of the date dimension table. The row-store was able to use a 'partition-by-reference' feature to partition the fact table by the year dimension attribute to yield a factor of two performance improvement on average. Once we vertically partitioned the data, we could no longer use the 'partition-by-reference' feature to horizontally partition the vertical partitions, since only one of the vertical partitions contained the appropriate reference to the date dimension table. If the product allowed for an extra level of indirection in 'partition-by-reference' (using the tuple id) we could have partitioned the vertical partitions by year and seen the factor of two performance improvement; but as far as we know, no row-store product allows for 'partition-by-reference-by-reference'.

3)  Partition joins. Although joins between vertical partitions can generally be done using high performance techniques due to the clustering by tuple id; the sheer number of joins required by the vertical partitioning approach occasionally overwhelmed the query optimizer, resulting in suboptimal plans. Column-stores generally avoid this issue by considering joins of columns within the same table separately from joins across tables, and since joins within a table are so common, they typically heuristically push high-performance column merge-joins to the bottoms of query plans.

We found a variety of other reasons why vertical partitioning does not in practice yield column-store performance, including a lack of column-oriented compression and a lack of optimizations for fixed-width attributes; but the above three reasons were responsible for the bulk of the observed performance difference.

In conclusion, our published experiments show that vertically partitioning a row-store can yield very different performance relative to using a column-store. The observed poor performance of the vertically partitioned row-store is likely the reason why most DBAs do not employ complete vertical partitioning to improve row-store performance; rather, they tend to use hybrid methods such as multi-column vertical partitions or column-subset materialized views (nonetheless, our paper contains some experiments that show that the column-store still outperforms the row-store, even when the row-store contains the optimal set of materialized views).

These results present a snapshot of the state of affairs as they exist today. However, as column-stores continue to gain market share in the data warehousing market, we expect some of the commercial row-store products to fix some of the issues mentioned above. The data size issue seems to be particularly low hanging fruit--it is easy to envision a row-store that uses virtual tuple IDs as join keys and that avoids the duplication of tuple headers across vertical partitions. So while vertical partitioning performance can (and most likely will) improve, how close it can get to column-store performance remains an open question.

Intuitively, one expects a DBMS designed specifically for vertical partitioning (i.e., a column-store) will have a fundamental advantage, since column-stores can make column-specific optimizations in all parts of the DBMS code (the query executer, the query optimizer, the storage layer, etc.) while a row-store will have to occasionally make trade offs to perform well for both row-oriented and column-oriented processing. In our next blog posting, we will explore this topic in more detail, showing how column-oriented optimizations made to DBMS components outside of the traditional (storage layer) realm associated with column-stores can yield order of magnitude performance improvements.  Hence, it will require wide spread and extensive code modifications to for row-stores to even approach column-store performance.

Debunking Another Myth: Column-Stores vs. Vertical Partitioning

Jue, 31/07/2008 - 21:46
We debunk another commonly proposed approach for making a row-store perform like a column-store: vertically partitioning a row-store. Vertical partitioning is a performance enhancing trick that some DBAs perform to enhance performance on read-mostly data warehouse workloads. The idea is to store an n-column table in n new tables. Each of these new tables contains two columns - a tuple ID column and data value column from the original table. Daniel Abadi http://www.databasecolumn.com

Debunking a Myth: Column-Stores vs. Indexes

Vie, 18/07/2008 - 20:50
Consider a traditional, row-oriented database.  Indexes are known to improve query performance. They can greatly reduce I/O costs by avoiding the need to perform table scans since they directly contain the data you need to answer a query or contain pointers to such data. If you have a query that accesses only two out of thirty columns from a large table, and you have an index on these two columns, then you can use the indexes to avoid scanning all of the data in the table.

A challenge when using a traditional database is deciding what indexes to create on your tables.  One either pays a DBA to carefully choose the right set of indexes to optimize a target workload, or you buy a database with an auto-tuning feature to create this set of indexes automatically (which might not be as good as a human DBA).

Ideally, it would be possible to have an index on every column. Unfortunately, every index you create results in the materialization of another copy of the column data (in addition to having other space overheads for pointers and other parts of the index data structure). Thus, the size of your database would be enormous if you had an index on every column. Even if you had infinite storage space (so that this explosion in data storage was not an issue), index maintenance is very expensive. Updates and inserts need to be reflected in the raw data and all of the indexes. Hence, there is a fundamental trade off - indexes improve query performance, but cost you in storage and maintenance. This is why you need an expert to choose the right set of indexes.

Now consider a column-store. By storing each column separately, the benefit appears similar to having an index on every column in a table in a row-store. If you have a query that accesses only two out of thirty columns from a large table, the column-store only reads  those two columns and can avoid the enormous table scan (just like having an index). However, since it is the raw copy of the data that is stored in columns, no additional copies of the data need to be created, so the storage and update overheads associated with indexes is avoided.

Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).

Though this argument sounds reasonable, it is completely incorrect.  It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.

Assume the following situation:
a) You already have a license for a commercial row-store
b) You have tons of extra storage space
c) You have a read-only workload (so index maintenance is not an issue)

Using the above reasoning, in this situation you would not need to go out and buy a column-store. You would just create an index on every column on your row-store.

In our SIGMOD 2008 paper, "Column-Stores vs. Row-Stores: How Different Are They Really?" which we presented last month in Vancouver, we explored this situation, running a commercial row-store (with no storage restrictions) on a read-only benchmark. The benchmark we used was the Star Schema Benchmark, a recently proposed benchmark designed to be more "typical" of data warehousing data and queries than TPC-H. We compared the performance of the commercial row-store (where we created an index on every column and forced the database to always use these indexes to access data instead of using a full table scan to access data) with the same row-store under a more normal configuration (optimized by a professional DBA) and with a column-store. The results are shown in the figure below:

The fact that the column-store was almost a factor of six faster than the row-store was not surprising. After all, column-stores are supposed to outperform row-stores for data warehousing workloads. But if one views a column-store as similar to a row-store with an index on every column, one would have expected the row-store (all-indexes) approach to perform about as fast as the column-store. Instead, it performed over a factor of 50 slower, and almost an order of magnitude slower than the same commercial row-store that used full table scans to access data instead of index accesses!

So what's going on here? It turns out that a column in a column-store is very different from an index. A column in a column-store stores attribute data in the same order that it appeared in the original table (or from a sorted projection of that table). You can think of this as mapping tuple ID to column value. For example, as shown in Figures 2 - 4, if you want the value for the "customer city" attribute for the 6th tuple in a table (or projection), you can find this value by jumping to the 6th value in the "customer city" column. On the other hand, an index contains the exact opposite mapping. It maps a column value to tuple ID. If you want to find the tuple ID for all tuples whose "customer city" is "Denver", an index is great. But what if you want to find the "customer city" of the 6th tuple? You would have to scan the whole index, looking for tuple ID 6.

 

 



 
 
So indexes are often useful in first part of query execution where predicate evaluation occurs (dealing with the "WHERE" part of a SQL statement), where you are looking for tuples with specific values (it turns out that even then, indexes are only useful for very selective predicates). But for the later part of the query plan, where the database is extracting values for attributes for specific tuple IDs (the "SELECT" and "GROUP BY" part of a SQL statement), you want a tuple ID to value mapping, and a column is better than an index. The reason why the "row-store all-indexes" approach went so slow in our experiments is that for each tuple ID produced by evaluating the predicates in the SQL "WHERE" clause, the database would have to search the index (using the wrong mapping) for each attribute that appeared in the "GROUP BY" and "SELECT" clauses. This can be thought of as adding one additional join to the query for each attribute that appears in the "GROUP BY" and "SELECT" clauses.

Hence, an index and a column are quite different data structures. Of course, there are some situations where what you really want is an index and not a column. For example, if you had a query workload with a lot of "needle-in-the-haystack" queries (queries with very selective predicates), you need to use a lot of indexes. If you have the incorrect perception that a column-store is pretty much the same as a row-store with an index on every column, you might be tempted to use a column-store. In fact, what you really want is a heavily indexed database (either a row-store, an indexed column-store, or a column-store with multiple redundant sort orders).

An astute reader might ask the question: what if the row-store was able to have indexes that mapped tuple-ID to value instead of the other way around? We studied that idea too, and although this significantly improves the performance of the all-index approach, it still does not approach the performance of the column-store. We will explain why this is the case in a future blog post.

(Ed.  This article was co-authored by Sam Madden)


Debunking a Myth: Column-Stores vs. Indexes

Vie, 18/07/2008 - 19:53
Consider a traditional, row-oriented database. Indexes are known to improve performance in database systems. They can greatly reduce I/O costs by avoiding the need to perform table scans since they directly contain the data you need to answer a query or contain pointers to such data. If you have a query that accesses only two out of thirty columns from a large table, and you have an index on these two columns, then you can use the indexes to avoid scanning all of the data in a table. Daniel Abadi http://www.databasecolumn.com

Database Column contributor: Daniel Abadi

Vie, 18/07/2008 - 03:52
Daniel's research interests are in database system architecture and implementation, cloud computing, and the Semantic Web. He currently serves on the Yale computer science faculty as an Assistant Professor. At Yale he
teaches both undergraduate and graduate level classes on database systems, and directs DR@Y, the database research group at Yale. Before joining Yale, he spent four years at the Massachusetts Institute of Technology
where he published numerous papers on column-store databases, lead the C-Store development effort, and wrote his Ph.D. dissertation on "Query Execution in Column-Oriented Database Systems". Daniel has been a recipient of a Churchill Scholarship, an NSF Graduate Research Fellowship, and a VLDB best paper award.

For more, Daniel's Website can be found at: http://cs-www.cs.yale.edu/homes/dna/

Understanding the Difference Between Column-Stores and OLAP Data Cubes

Mié, 09/07/2008 - 15:29
Both column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.)  These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions.  For example, in a sales database, one might wish to find the sales of technology products by month and store--the SQL query to do this would look like:
SELECT month, store, COUNT(*)
FROM sales, products
WHERE productType = 'technology'
AND products.id = sales.productID
GROUP BY month, store

In this post, we study how column-stores and data cubes would evaluate this query on a sample database:

Column Store Analysis
In column-stores, this query would be answered by scanning the productType column of the products table to find the ids that have type technology.  These ids would then be used to filter the productID column of the sales table to find positions of records with the appropriate product type.  Finally, these positions would be used to select data from the months and stores columns for input into the GROUP BY operator.  Unlike in a row-store, the column-store only has to read a few columns of the sales table (which, in most data warehouses, would contain tens of columns), making it significantly faster than most commercial relational databases that use row-based technology.

Also, if the table is sorted on some combination of the attributes used in the query (or if a materialized view or projection of the table sorted on these attributes is available), then substantial performance gains can be obtained both from compression and the ability to directly offset to ranges of satisfying tuples.  For example, notice that the sales table is sorted on productID, then month, then storeID.   Here, all of the records for a given productID are co-located, so the extraction of matching productIDs can be done very quickly using binary search or a sparse index that gives the first record of each distinct productID.  Furthermore, the productID column can be effectively run-length encoded to avoid storing repeated values, which will use much less storage space (see my previous post on compression in column-stores).  Run-length encoding will also be effective on the month and storeID columns, since for a group of records representing a specific productID, month is sorted, and for a group of records representing a given (productID,month) pair, storeID is sorted.  For example, if there are 1,000,000 sales records of about 1,000 products sold by 10 stores, with sales uniformly distributed across products, months and stores, then the productID column can be stored in 1,000 records (one entry per product), the month column can be stored in 1,000 x 12 = 12,000 records, and the storeID column can be stored in and 1,000 x 12 x 10 = 120,000 records.  This compression means that less the amount of data read from disk is less than 5% of its uncompressed size.

Data Cube Analysis


Data cube-based solutions (sometimes referred to as MOLAP systems for "multidimensional online analytical processing"), are represented by commercial products such as EssBase.  They  store data in array-like structures, where the dimensions of the array represent columns of the underlying tables, and the values of the cells represent pre-computed aggregates over the data.  A data cube on the product, store, and month attributes of the sales table, for example, would be stored in an array format as shown in the figure above.  Here, the cube includes "roll-up" cells that summarize the values of the cells in the same row, column, or "stack" (x,y position.) If we want to use a cube to compute the values of the COUNT aggregate, as in the query above, the cells of this cube would look like:

 

Here, each cell contains the count of the number of records with a given (productID,month,storeID) value.  For example, there is one record with storeID=1, productID=2, and month=April.  The "sum" fields indicate the values of the COUNT "rolled up" on specific dimensions; for example, looking at the lower left hand corner of the cube for Store 1, we can see that in storeID 1, productID 1 was sold twice across all months.  Thus, to answer the above query using a data cube, we first identify the subset of the cube that satisfies the WHERE clause (here, products 3, 4, and 5 are technology products--this is indicated by their dark shading in the above figure.)  Then, the system reads the pre-aggregated values from sum fields for the unrestricted attributes (store and month), which gives the result that store 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 technology sale in February and 1 in October.

The advantages of a data cube should be clear--it contains pre-computed aggregate values that make it a very compact and efficient way to retrieve answers for specific aggregate queries.  It can be used to efficiently compute a hierarchy of aggregates--for example, the sum columns in the above cube make it is very fast to compute the number of sales in a given month across all stores, or the number of sales or a particular product across the entire year in a given store.  Because the data is stored in an array-structure, and each element is the same size, direct offsetting to particular values may be possible. However, data cubes have several limitations:

  • Sparsity:  Looking at the above cube, most of the cells are empty.  This is not simply an artifact our sample data set being small--the number of cells in a cube is the product of the cardinalities of the dimensions in the cube.  Our 3D cube with 10 stores and 1,000 products would have 120,000 cells, and adding a fourth dimension, such as customerID (with, say, 10,000 values), would cause the number of cells to balloon to 1.2 billion!  Such high dimensionality cubes cannot be stored without compression.  Unfortunately, compression can limit performance somewhat, as direct offsetting is no longer possible. For example, a common technique is to store them as a table with the values and positions of the non-empty cells, resulting in an implementation much like a row-oriented relational database!
  • Inflexible, Limited ad-hoc query support:  Data cubes work great when a cube aggregated on the dimensions of interest and using the desired aggregation functions is available.  Consider, however, what happens in the above example if the user wants to compute the average sale price rather than the count of sales, or if the user wants to include aggregates on customerID in addition to the other attributes.  If no cube is available, the user has no choice but to fall back to queries on an underlying relational system.  Furthermore, if the user wants to drill down into the underlying data--asking, for example "who was the customer who bought a technology product at store 2 in February?"--the cube cannot be used (one could imagine storing entire tuples, or pointers to tuples, in the cells of a cube, but like sparse representations, this significantly complicates the representation of a cube and can lead to storage space explosions.)  To deal with these limitations, some cube systems support what is called "HOLAP" or "hybrid online analytical processing", where they will automatically redirect queries that cannot be answered with cubes to a relational system, but such queries run as fast as whatever relational system executes them.
  • Long load times:  Computing a cube requires a complex aggregate query over all of the data in a warehouse (essentially, every record has to be read from the database.)  Though it is possible to incrementally update cubes as new data arrives, it is impractical to dynamically create new cubes to answer ad-hoc queries.
Summary and Discussion

Data cubes work well in environments where the query workload is predictable, so that cubes needed to answer specific queries can be pre-computed.  They are inappropriate for ad-hoc queries or in situations where complex relational expressions are needed.  

In contrast, column-stores provide very good performance across a much wider range of queries (all of SQL!) However, for low-dimensionality pre-computed aggregates, it is likely that a data-cube solution will outperform a column store. For many-dimensional aggregates, the tradeoff is less clear, as sparse cube representations are unlikely to perform any better than a column store.

Finally, it is worth noting that there is no reason that cubes cannot be combined with column-stores, especially in a HOLAP-style configuration where queries not directly answerable from a cube are redirected to an underlying column-store system.  That said, given that column-stores will typically get very good performance on simple aggregate queries (even if cubes are slightly faster), it is not clear if the incremental cost of maintaining and loading an additional cube system to compute aggregates is ever worthwhile in a column-store world.  Furthermore, existing HOLAP products, which are based on row-stores, are likely to be an order of magnitude or more slower than column-stores on ad-hoc queries that cannot be answered by the MOLAP system, for the same reasons discussed elsewhere in this blog.



Understanding the Difference Between Column-Stores and OLAP Data Cubes

Mar, 08/07/2008 - 03:47
Both column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.) These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions. In this post, we study how column-stores and data cubes would evaluate a query on a sample database. Sam Madden

Designing Systems for the Grid: The Problem with "Retrofitting," Part 1

Sáb, 21/06/2008 - 02:09
One of the key features of new data warehouse databases such as Vertica is their ground-up support for distributed, shared-nothing grid computing architectures. Because of scalability and low costs, such architectures are becoming the norm in large enterprises, and because of their scalability requirements, data warehouses are a natural fit in this world.

Some database vendors have attempted to "retrofit" their centralized DBMS designs to work in a distributed world. The basic idea of retrofitting is to reuse as much of the centralized code base as possible in producing a distributed design. The motivation for retrofitting is clear: In building a database system, it is often easier to reuse existing components and adapt them for new uses than it is to build a new system from scratch. This reduces the time-to-market of a product that can claim grid support. But the performance improvements of these retrofitted systems often do not live up to the raw increases in horsepower that grid architectures provide.  There are two reasons for this:

  1. Code that gets reused in a retrofitted system is often brittle as a result of many years of patchwork. To change this code introduces potential instabilities, and thus, there is a natural desire to instead treat such code as "black boxes."

  2. Black box code was often designed with assumptions of a centralized architecture, and these assumptions may constrain performance when executed over a distributed system over which the assumptions do not hold.

We will illustrate this point using the database query optimizer as an illustrative example of how retrofitting strategies can fail. This argument will require some understanding of how centralized query optimizers work.  Therefore, we will divide this discussion into two parts. The first installment will provide a background on centralized query optimization; the second installment will show why retrofitting a centralized query optimizer to work on the grid can lead to poor performance when evaluating queries.


Part 1:  A Primer on Centralized Query Optimization

In this installment, we present a primer on centralized query optimization.  

The purpose of a query optimizer is to produce a cost-effective evaluation plan (or just plan) for any query submitted to the database. The basic strategy used to come up with this plan is largely the same for every query optimizer:

a) It first formulates a set of candidate plans that could be used to evaluate the query

b) It then applies a cost model to predict the execution time (cost) of each of the candidate plans. A cost model consists of a set of formulas that specify the sizes of intermediate query results, and the cost (e.g., time) required to produce them. For example, a simplistic cost model measures cost as the number of disk reads required to evaluate the query, with the idea that queries that perform the fewest disk reads will execute in the least time.

c)  Upon evaluating the cost of each candidate plan, the query optimizer then selects the plan with least cost.

One of the most crucial design decisions that affects the effectiveness of a query optimizer lies in how it limits the size of the space of candidate plans (the search space) that it must consider. Specifically, a query that includes multiple tables in its FROM clause can be evaluated using any of a number of plans that differ only by the order in which these tables are joined. Consider, for example, the SQL query fragment below:

       SELECT     *
       FROM      T1, T2, T3, T4
       WHERE     ...

This query must join 4 tables: T1, T2, T3, and T4. The order in which pairs of tables are joined can vary. For example, the binary tree below (hereafter referred to as a join plan) shows one possible join ordering.

The join plan above specifies that the first join to be performed is of T1 and T2 (the "bowtie" icon specifies a join), followed by the result being joined with T3, and the result of this in turn being joined with T4. This join plan has a structure which is left-deep (or equivalently, right-shallow) because the right branch of every join in the plan is a base table.  

For the 4-table query above, there are 4! = 24 different left-deep join plans that differ according to which tables correspond to which leaves of the tree (4! = the number of sequences of a set of 4 items). Aside from the left-deep plans, there are 4 other join plan structures that are possible for the query above as illustrated below.  Note that each of these join plan structures also has 24 variations given a query with 4 tables (by permuting the tables at the leaves), so in all, there are 5 * 24 = 120 join plans for an optimizer to consider for this query.


In general, given n tables to be joined, there are:  

(2(n-1))!
---------------
n! * (n-1)!

possible join plan structures(1), and for each join plan structure, n! possible join plans, giving a total of:

(2(n-1))!
---------------
(n-1)!
 
join plans that an optimizer could consider. As the number of tables in a query grows, the number of join plans to consider quickly becomes infeasible. For example, whereas a query with 4 tables requires consideration of 120 join plans, a query with 5 tables requires consideration of 151,200 join plans, a query with 6 tables requires consideration of 3,991,680 join plans, and so on.

To cope with this enormous search space, all query optimizers must somehow limit the set of join plans considered. IBM's System R (from the late 1970s) first introduced the idea of limiting the search space to the set of join plans with a left-deep join plan structure because left-deep plans ensure that every binary join is performed with at least one participant table on disk, thereby ensuring that a join operator can produce output incrementally as its input data arrives (pipelining). The left-deep restriction reduces the number of join plans to consider for a query of n tables to n!, and dynamic programming techniques can be used to find the "best" query plan in this space in exponential time. In practice, this is a reasonable amount of time to process queries consisting of roughly 30 tables or fewer (YMMV), and thus, this heuristic is still used to narrow the search space of most commercial DBMS. 

Of course, there are many other challenges in designing an effective query optimizer aside from managing the search space, including proper choices of access methods and indexes, query unnesting, etc. But in the next installment of this blog, I will show how the typical retrofitted query optimizer determines its search space for plans that apply to the grid and how the resulting optimizer can fail to produce appropriate plans.
 (1) This is the known as the nth Catalan number, which specifies (among other things) the number of binary tree "shapes" consisting of n leaves.

Part 2 will be available next week . . .

Designing Systems for the Grid: The Problem with "Retrofitting," Part 1

Jue, 19/06/2008 - 20:26
In a two-part post, we will illustrate how -- using a database query optimizer as an example -- the strategy of retrofitting databases in a distributed, shared-nothing grid computing architecture can fail. This argument will require some understanding of how centralized query optimizers work. Therefore, we will divide this discussion into two parts. The first installment will provide a background on centralized query optimization; the second installment will show why retrofitting a centralized query optimizer to work on the grid can lead to poor performance when evaluating queries. Mitch Cherniack

DBMS innovations that will make analytics in the cloud a reality

Mié, 04/06/2008 - 00:30
There will soon be a myriad of announcements of DBMS offerings in the cloud. Many of these will NOT be marriages made in heaven. However, the most innovative new DBMS software combined with new cloud computing services are here today and truly take advantage of the cloud architecture in order to change the economics and the responsiveness of business analytics.

In my last article, I described how I think cloud computing will change the economics of business intelligence (BI) and enable a variety of new analytic data management projects and business possibilities. It does so by making the hardware, networking, security, and software needed to create data marts and data warehouses available on demand with a pay-as-you-go approach to usage and licensing.

A computing cloud, such as the Amazon Elastic Compute Cloud, is composed of thousands of commodity servers running multiple virtual machine instances (VMs) of the applications hosted in the cloud. As customer demand for those applications changes, new servers are added to the cloud or idled and new VMs are instantiated or terminated.

Cloud computing infrastructure differs dramatically from the infrastructure underlying most in-house data warehouses and data marts. There are no high-end servers with dozens of CPU cores, SANs, replicated systems, or proprietary data warehousing appliances available in the cloud. Therefore, a new DBMS software architecture is required to enable large volumes of data to be analyzed quickly and reliably on the cloud's commodity hardware. Recent DBMS innovations make this a reality today, and the best cloud DBMS architectures will include:

  1. Shared-nothing, massively parallel processing (MPP) architecture. In order to drive down the cost of creating a utility computing environment, the best cloud service providers use huge grids of identical (or similar) computing elements. Each node in the grid is typically a compute engine with its own attached storage. For a cloud database to successfully "scale out" in such an environment, it is essential that the database have a shared-nothing architecture utilizing the resources (CPU, memory, and disk) found in server nodes added to the cluster. Most databases popularly used in BI today have shared-everything or shared-storage architectures, which will limit their ability to scale in the cloud.

  2. Automatic high availability. Within a cloud-based analytic database cluster, node failures, node changes, and connection disruptions can occur. Given the vast number of processing elements within a cloud, these failures can be made transparent to the end user if the database has the proper built-in failover capabilities. The best cloud databases will replicate data automatically across the nodes in the cloud cluster, be able to continue running in the event of 1 or more node failures ("k-safety"), and be capable of restoring data on recovered nodes automatically -- without DBA assistance. Ideally, the replicated data will be made "active" in different sort orders for querying to increase performance.

  3. Ultra-high performance. One of the game-changing advantages of the cloud is the ability to get an analytic application up quickly (without waiting for hardware procurement). However, there can be some performance penalty due to Internet connectivity speeds and the virtualized cloud environment. If the analytic performance is disappointing, the advantage is lost. Fortunately, the latest shared-nothing columnar databases are designed specifically for analytic workloads, and they have demonstrated dramatic performance improvements over traditional, row-oriented databases (as verified by industry experts, such as Gartner and Forrester, and by customer benchmarks). This software performance improvement, coupled with the hardware economies of scale provided by the cloud environment, results in a new economic model and competitive advantage for cloud analytics.

  4. Aggressive compression. Since cloud costs are typically driven by charges for processor and disk storage utilization, aggressive data compression will result in very large cost savings. Row-oriented databases can achieve compression factors of about 30% to 50%; however, the addition of necessary indexes and materialized views often swells databases to 2 to 5 times the size of the source data. But since the data in a column tends to be more similar and repetitive than attributes within rows, column databases often achieve much higher levels of compression. They also don't require indexes. The result is normally a 4x to 20x reduction in the amount of storage needed by columnar databases and a commensurate reduction in storage costs.

  5. Standards-based connectivity. While there are a number of special-purpose file systems that have been developed for the cloud environment that can provide high performance, they lack the standard connectivity needed to support general-purpose business analytics. The broad base of analytic users will use existing commercial ETL and reporting software that depend on SQL, JDBC, ODBC, and other DBMS connectivity standards to load and query cloud databases. Therefore, it's imperative for cloud databases to support these connection standards to enable widespread use of analytic applications.
In summary, cloud databases with the architectural characteristics described above will be able to not just run in the cloud, but thrive there by:

  • "Scaling out," as the cloud itself does

  • Running fast without high-end or custom hardware

  • Providing high availability in a fluid computing environment

  • Minimizing data storage, transfer, and CPU utilization (to keep cloud computing fees low)


DBMS innovations that will make analytics in the cloud a reality

Mié, 04/06/2008 - 00:08
There will soon be a myriad of announcements of DBMS offerings in the cloud. Many of these will NOT be marriages made in heaven. However, the most innovative new DBMS software married to new cloud computing services are here today and truly take advantage of the cloud architecture in order to change the economics and the responsiveness of business analytics. Jerry Held