Byte-dictionary encoding
In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block.
This encoding is very effective when a column contains a limited number of unique values. This encoding is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding is especially space-efficient if a CHAR column holds long character strings.
Byte-dictionary encoding is not always effective when used with VARCHAR columns. Using BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly recommend using a different encoding, such as LZO, for VARCHAR columns.
Suppose that a table has a COUNTRY column with a CHAR(30) data type. As data is loaded, Amazon Redshift creates the dictionary and populates the COUNTRY column with the index value. The dictionary contains the indexed unique values, and the table itself contains only the one-byte subscripts of the corresponding values.
Trailing blanks are stored for fixed-length character columns. Therefore, in a CHAR(30) column, every compressed value saves 29 bytes of storage when you use the byte-dictionary encoding.
The following table represents the dictionary for the COUNTRY column.
Unique data value | Dictionary index | Size (fixed length, 30 bytes per value) |
---|---|---|
England | 0 | 30 |
United States of America | 1 | 30 |
Venezuela | 2 | 30 |
Sri Lanka | 3 | 30 |
Argentina | 4 | 30 |
Japan | 5 | 30 |
Total | 180 |
The following table represents the values in the COUNTRY column.
Original data value | Original size (fixed length, 30 bytes per value) | Compressed value (index) | New size (bytes) |
---|---|---|---|
England | 30 | 0 | 1 |
England | 30 | 0 | 1 |
United States of America | 30 | 1 | 1 |
United States of America | 30 | 1 | 1 |
Venezuela | 30 | 2 | 1 |
Sri Lanka | 30 | 3 | 1 |
Argentina | 30 | 4 | 1 |
Japan | 30 | 5 | 1 |
Sri Lanka | 30 | 3 | 1 |
Argentina | 30 | 4 | 1 |
Total | 300 | 10 |
The total compressed size in this example is calculated as follows: 6 different entries are stored in the dictionary (6 * 30 = 180), and the table contains 10 1-byte compressed values, for a total of 190 bytes.