In SQL Server 2012, we introduced the memory optimized columnstore index. This resulted in huge query performance improvements but columnstore indexes had one significant disadvantage – they couldn’t be updated directly but had to be rebuilt on a base table. In SQL Server 2012, anyone wanting to take advantage of a this new index and also wanting to update the data behind it had to either drop the index and update the data, do some partition switching or unioning queries to the columnstore index and a regular row store containing the changing data. These workarounds are no longer required in SQL Server 2014 as we have enhanced the columnstore to be a pure columnar store, so indexing is no longer required..
In this next version, we’re introducing enhancements to the In-Memory ColumnStore for data warehousing implemented as a clustered columnstore index (or CCI) on a table. The data in a CCI is grouped and stored for every column in the table. Unlike the columnstore index, the CCI is the data – there is no other underlying data structure.
(Short aside - this might be a little confusing especially to those that haven’t lived and breathed SQL Server for the last little while. SQL has had two traditional table structures - heaps and clustered tables. Heaps, like the name suggests, aren’t ordered by anything. Clustered tables are ordered by the clustered index – in fact, the clustered index is the table. So CCI is analogous to a clustered index in that it represents the table as well.)
Unlike a columnstore index, CCIs are updateable just like any other table. (The internal details of what happens under the covers when updating a CCI are interesting and we’ll get to that in a separate post.) Now that the DML problem associated with columnstore indexes have gone away with CCIs, we can now make apples to apples comparison between columnstore and row store.
Organizing the table by column provides substantial advantages. The first one is improved query performance. When data is organized and compressed by individual columns, scan operations are significantly faster. We typically expect data warehousing type queries (ie, queries with aggregates involving large scans) to be about 10x faster (no kidding!). I’ve seen some almost unbelievable performance numbers much better than that. So if you’re developing or managing a data warehousing application, you owe yourself a favor to check this out solely for the query performance.
Second thing to expect with CCI is about a 2x improvement in data compression compared to row store page compression. Mileage may vary based on variability of values in each column, but results in this regard have been pretty consistent in my experience.
There’s also a new compression format “COLUMNSTORE_ARCHIVE” that’s available for columnstore indexes only. With the new archival compression, we can further compress the data which results in even less disk space being used. Internal tests have shown compression rates roughly double with the combination of a columnstore index and archival compression.
In our next blog entries, we’ll go into more details and show examples on how the new in-memory columnstore functionality works.
SQL Server 2014 CTP1 is now available for download here.