This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.
Data warehouse technology options
In this section, we discuss options available for building a data warehouse: row-oriented databases, column-oriented databases, and massively parallel processing architectures.
Row-oriented databases
Row-oriented databases typically store whole rows in a physical block. High performance for read operations is achieved through secondary indexes. Databases such as Oracle Database Server, Microsoft SQL Server, MySQL, and PostgreSQL are row-oriented database systems. These systems have been traditionally used for data warehousing, but they are better suited for transactional processing (OLTP) than for analytics.
To optimize performance of a row-based system used as a data warehouse, developers use a number of techniques, including:
-
Building materialized views
-
Creating pre-aggregated rollup tables
-
Building indexes on every possible predicate combination
-
Implementing data partitioning to leverage partition pruning by query optimizer
-
Performing index-based joins
Traditional row-based data stores are limited by the resources
available on a single machine. Data marts alleviate the problem,
to an extent, by using functional
sharding
In a row-based data warehouse, every query has to read through all of the columns for all of the rows in the blocks that satisfy the query predicate, including columns you didn’t choose. This approach creates a significant performance bottleneck in data warehouses, where your tables have more columns, but your queries use only a few.
Column-oriented databases
Column-oriented databases organize each column in its own set of physical blocks instead of packing the whole rows into a block. This functionality allows them to be more input/output (I/O) efficient for read-only queries, because they have to read only those columns accessed by a query from disk (or from memory). This approach makes column-oriented databases a better choice than row-oriented databases for data warehousing.
Figure 3 illustrates the primary difference between row-oriented and column-oriented databases. Rows are packed into their own blocks in a row-oriented database, and columns are packed into their own blocks in a column-oriented database.

Row-oriented vs. column-oriented databases
After faster I/O, the next biggest benefit to using a column-oriented database is improved compression. Because every column is packed into its own set of blocks, every physical block contains the same data type. When all the data is the same data type, the database can use extremely efficient compression algorithms. As a result, you need less storage compared to a row-oriented database. This approach also results in significantly lesser I/O, because the same data is stored in fewer blocks.
Some column-oriented databases that are used for data warehousing
include Amazon Redshift,
Vertica
Massively Parallel Processing (MPP) architectures
An MPP architecture enables you to use all the resources available
in the cluster for processing data, which dramatically increases
performance of petabyte scale data warehouses. MPP data warehouses
allow you improve performance by simply adding more nodes to the
cluster. Amazon Redshift, Druid, Vertica, Greenplum, and Teradata
Aster are some of the data warehouses built on an MPP
architecture. Open-source frameworks such as
Hadoop