Noticias externas

State Of Lightning Talks 2008 "Final" Program

Data Management Blogs - Hace 15 horas 59 mins

For three years now, OSCON has provided a 90-minute session for open source projects to update OSCON attendees on "what's going on this year." If you're new to open source, it's a great way to get an idea of the scope of open source software out there. If you're an OSS geek, it's a chance to see what's new with your favorite projects.

Most importantly, with each presentation strictly

New Application Management Pack for Siebel Customer Self Study Training Available

Data Management Blogs - Vie, 18/07/2008 - 22:25
Application Management Pack for Siebel’s eStudy is now available. The self pace online course is a tutorial for deploying, configuring and using our Siebel pack. The course assumes familiarity of base EM capabilities taught in the 5 days instructor led Enterprise Manager Grid Control training course, and complements other training that covers EM features such as Service Level Management and Configuration

A letter to you, the customer, about the Information on Demond (IOD) 2008 Conference

Data Management Blogs - Vie, 18/07/2008 - 21:10
I am pleased to announce the IBM Information On Demand (IOD) 2008 global conference, October 26-31, 2008 at the majestic Mandalay Bay Hotel and Convention Center in Las Vegas, Nevada. The global conference continues to bring our information management community fresh, innovative ideas to grow, to optimize your business and to win! Last year our customers and business partners networked,

Debunking a Myth: Column-Stores vs. Indexes

The Database Column - Vie, 18/07/2008 - 20:50
Consider a traditional, row-oriented database.  Indexes are known to improve query performance. They can greatly reduce I/O costs by avoiding the need to perform table scans since they directly contain the data you need to answer a query or contain pointers to such data. If you have a query that accesses only two out of thirty columns from a large table, and you have an index on these two columns, then you can use the indexes to avoid scanning all of the data in the table.

A challenge when using a traditional database is deciding what indexes to create on your tables.  One either pays a DBA to carefully choose the right set of indexes to optimize a target workload, or you buy a database with an auto-tuning feature to create this set of indexes automatically (which might not be as good as a human DBA).

Ideally, it would be possible to have an index on every column. Unfortunately, every index you create results in the materialization of another copy of the column data (in addition to having other space overheads for pointers and other parts of the index data structure). Thus, the size of your database would be enormous if you had an index on every column. Even if you had infinite storage space (so that this explosion in data storage was not an issue), index maintenance is very expensive. Updates and inserts need to be reflected in the raw data and all of the indexes. Hence, there is a fundamental trade off - indexes improve query performance, but cost you in storage and maintenance. This is why you need an expert to choose the right set of indexes.

Now consider a column-store. By storing each column separately, the benefit appears similar to having an index on every column in a table in a row-store. If you have a query that accesses only two out of thirty columns from a large table, the column-store only reads  those two columns and can avoid the enormous table scan (just like having an index). However, since it is the raw copy of the data that is stored in columns, no additional copies of the data need to be created, so the storage and update overheads associated with indexes is avoided.

Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).

Though this argument sounds reasonable, it is completely incorrect.  It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.

Assume the following situation:
a) You already have a license for a commercial row-store
b) You have tons of extra storage space
c) You have a read-only workload (so index maintenance is not an issue)

Using the above reasoning, in this situation you would not need to go out and buy a column-store. You would just create an index on every column on your row-store.

In our SIGMOD 2008 paper, "Column-Stores vs. Row-Stores: How Different Are They Really?" which we presented last month in Vancouver, we explored this situation, running a commercial row-store (with no storage restrictions) on a read-only benchmark. The benchmark we used was the Star Schema Benchmark, a recently proposed benchmark designed to be more "typical" of data warehousing data and queries than TPC-H. We compared the performance of the commercial row-store (where we created an index on every column and forced the database to always use these indexes to access data instead of using a full table scan to access data) with the same row-store under a more normal configuration (optimized by a professional DBA) and with a column-store. The results are shown in the figure below:

The fact that the column-store was almost a factor of six faster than the row-store was not surprising. After all, column-stores are supposed to outperform row-stores for data warehousing workloads. But if one views a column-store as similar to a row-store with an index on every column, one would have expected the row-store (all-indexes) approach to perform about as fast as the column-store. Instead, it performed over a factor of 50 slower, and almost an order of magnitude slower than the same commercial row-store that used full table scans to access data instead of index accesses!

So what's going on here? It turns out that a column in a column-store is very different from an index. A column in a column-store stores attribute data in the same order that it appeared in the original table (or from a sorted projection of that table). You can think of this as mapping tuple ID to column value. For example, as shown in Figures 2 - 4, if you want the value for the "customer city" attribute for the 6th tuple in a table (or projection), you can find this value by jumping to the 6th value in the "customer city" column. On the other hand, an index contains the exact opposite mapping. It maps a column value to tuple ID. If you want to find the tuple ID for all tuples whose "customer city" is "Denver", an index is great. But what if you want to find the "customer city" of the 6th tuple? You would have to scan the whole index, looking for tuple ID 6.

 

 



 
 
So indexes are often useful in first part of query execution where predicate evaluation occurs (dealing with the "WHERE" part of a SQL statement), where you are looking for tuples with specific values (it turns out that even then, indexes are only useful for very selective predicates). But for the later part of the query plan, where the database is extracting values for attributes for specific tuple IDs (the "SELECT" and "GROUP BY" part of a SQL statement), you want a tuple ID to value mapping, and a column is better than an index. The reason why the "row-store all-indexes" approach went so slow in our experiments is that for each tuple ID produced by evaluating the predicates in the SQL "WHERE" clause, the database would have to search the index (using the wrong mapping) for each attribute that appeared in the "GROUP BY" and "SELECT" clauses. This can be thought of as adding one additional join to the query for each attribute that appears in the "GROUP BY" and "SELECT" clauses.

Hence, an index and a column are quite different data structures. Of course, there are some situations where what you really want is an index and not a column. For example, if you had a query workload with a lot of "needle-in-the-haystack" queries (queries with very selective predicates), you need to use a lot of indexes. If you have the incorrect perception that a column-store is pretty much the same as a row-store with an index on every column, you might be tempted to use a column-store. In fact, what you really want is a heavily indexed database (either a row-store, an indexed column-store, or a column-store with multiple redundant sort orders).

An astute reader might ask the question: what if the row-store was able to have indexes that mapped tuple-ID to value instead of the other way around? We studied that idea too, and although this significantly improves the performance of the all-index approach, it still does not approach the performance of the column-store. We will explain why this is the case in a future blog post.

(Ed.  This article was co-authored by Sam Madden)


Giving up on Work e-mail - Status Report on Week 22 (Start Controling Your e-mail Addiction)

Data Management Blogs - Vie, 18/07/2008 - 15:55

As I am starting the process to wrap up another very interesting week at work with plenty of things happening to get things going

Oracle Control File - more than just to multiplex

Data Management Blogs - Vie, 18/07/2008 - 14:17
Don't take the information stored in an Oracle control file for granted..make use of it.

Oracle denial of service requests and the Oracle Library Cache

Data Management Blogs - Vie, 18/07/2008 - 13:31
Oracle's library cache is one such internal structure that, after learning about, can help eliminate some very nasty denial of service requests originating from application users.

Bryce's Pet Peeve - Polls - 7/21/2008

Data Management Blogs - Vie, 18/07/2008 - 09:14
My Pet Peeve of the Week is

MGMT VISIONS - Recognizing the Peter Principle - July 21, 2008

Data Management Blogs - Vie, 18/07/2008 - 08:52
This week my essay is entitled,

Database Column contributor: Daniel Abadi

The Database Column - Vie, 18/07/2008 - 03:52
Daniel's research interests are in database system architecture and implementation, cloud computing, and the Semantic Web. He currently serves on the Yale computer science faculty as an Assistant Professor. At Yale he
teaches both undergraduate and graduate level classes on database systems, and directs DR@Y, the database research group at Yale. Before joining Yale, he spent four years at the Massachusetts Institute of Technology
where he published numerous papers on column-store databases, lead the C-Store development effort, and wrote his Ph.D. dissertation on "Query Execution in Column-Oriented Database Systems". Daniel has been a recipient of a Churchill Scholarship, an NSF Graduate Research Fellowship, and a VLDB best paper award.

For more, Daniel's Website can be found at: http://cs-www.cs.yale.edu/homes/dna/

I Have a Spreadsheet. Why Do I Need Customer Support Software?

Data Management Blogs - Jue, 17/07/2008 - 22:46
No more lost information, missed SLAs, or incidents falling through the cracks. No more calling multiple numbers to find your customer. No more attempting to update a gargantuan spreadsheet without errors or promising yourself you will design the perfect database for all this information when you have the time.

PostgreSQL at OSCON

Data Management Blogs - Jue, 17/07/2008 - 22:06
As always, we're going to have a lot of PostgreSQL activity around OSCON. Here's my list of things to see and do ...

Is Email In Danger? - Depends on Its Ability to Evolve to Meet Your Needs

Data Management Blogs - Jue, 17/07/2008 - 21:49

Now that this blog has turned itself into A KM Blog Thinking Outside the Inbox, I thought I would keep things going further by commenting on interesting links I have bumped into over the last few days and which basically touch on the subject of how e-mail

Learning About Process

Data Management Blogs - Jue, 17/07/2008 - 19:40
In any project you are creating something that must do something. Mostly we think about the "what" in our projects and too often we don't think of the "why". To be an excellent business analyst you must never stop asking WHY. There are some techniques that I use even if I have very specific requirements documents. The reason is that most of the time a user will tell you not just their requirement

Hey, arent you Matt Moran... I Am A Star!

Data Management Blogs - Jue, 17/07/2008 - 13:12

I am preparing to head out to Chicago but I just remembered a story I have to relay.

Jess and I were at the gym about 2 weeks ago.  I was walking to the water fountain when a guy stopped me and said, "Hey, aren't you Matt Moran, the songwriter?"

"Yesss....," I replied, a little bit surprised.

"My wife and I have seen you play

ACID vs BASE, Part 2

Data Management Blogs - Mié, 16/07/2008 - 15:44

LewisC's An Expert's Guide To Oracle Technology

I know I promised to write about my ideal job but I decided to continue the thought stream I started the other day in ACID vs BASE, I decided to do a little bit of research on the

RSU0806 is now available

Data Management Blogs - Mié, 16/07/2008 - 13:15
Testing is complete for a new RSU and IBM's Consolidated Service Test (CST) Web site has updated to the latest RSU service package, RSU0806.

Getting Ready for Chicago - House Concert this Saturday

Data Management Blogs - Mié, 16/07/2008 - 12:50
If you are in Chicago, consider dropping me a line and coming out on Saturday...

Recognizing the Peter Principle

Data Management Blogs - Mié, 16/07/2008 - 10:54
Describes how to identify the attributes of the Peter Principle.

I Freed Myself From E-Mail's Grip - Additional Commentary (Part II)

Data Management Blogs - Mié, 16/07/2008 - 09:08

Here is Part II of that extended commentary on "I Freed Myself From E-Mail's Grip" where I keep adding further up various different arguments on why e-mail is perhaps not the best of collaboration and knowledge sharing tools available out there ...

,
Distribuir contenido