Byte-dictionary encoding - Amazon Redshift

Byte-dictionary encoding

In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block.

This encoding is very effective when a column contains a limited number of unique values. This encoding is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding is especially space-efficient if a CHAR column holds long character strings.

Note

Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly recommend using a different encoding, such as LZO, for VARCHAR columns.

Suppose that a table has a COUNTRY column with a CHAR(30) data type. As data is loaded, Amazon Redshift creates the dictionary and populates the COUNTRY column with the index value. The dictionary contains the indexed unique values, and the table itself contains only the one-byte subscripts of the corresponding values.

Note

Trailing blanks are stored for fixed-length character columns. Therefore, in a CHAR(30) column, every compressed value saves 29 bytes of storage when you use the byte-dictionary encoding.

The following table represents the dictionary for the COUNTRY column.

Unique data value Dictionary index Size (fixed length, 30 bytes per value)
England 0 30
United States of America 1 30
Venezuela 2 30
Sri Lanka 3 30
Argentina 4 30
Japan 5 30
Total 180

The following table represents the values in the COUNTRY column.

Original data value Original size (fixed length, 30 bytes per value) Compressed value (index) New size (bytes)
England 30 0 1
England 30 0 1
United States of America 30 1 1
United States of America 30 1 1
Venezuela 30 2 1
Sri Lanka 30 3 1
Argentina 30 4 1
Japan 30 5 1
Sri Lanka 30 3 1
Argentina 30 4 1
Total 300 10

The total compressed size in this example is calculated as follows: 6 different entries are stored in the dictionary (6 * 30 = 180), and the table contains 10 1-byte compressed values, for a total of 190 bytes.