Mostly encoding - Amazon Redshift

Mostly encoding

Mostly encodings are useful when the data type for a column is larger than most of the stored values require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2 column, to 8-bit storage.

In general, the mostly encodings work with the following data types:

  • SMALLINT/INT2 (16-bit)

  • INTEGER/INT (32-bit)

  • BIGINT/INT8 (64-bit)

  • DECIMAL/NUMERIC (64-bit)

Choose the appropriate variation of the mostly encoding to suit the size of the data type for the column. For example, apply MOSTLY8 to a column that is defined as a 16-bit integer column. Applying MOSTLY16 to a column with a 16-bit data type or MOSTLY32 to a column with a 32-bit data type is disallowed.

Mostly encodings might be less effective than no compression when a relatively high number of the values in the column can't be compressed. Before applying one of these encodings to a column, perform a check. Most of the values that you are going to load now (and are likely to load in the future) should fit into the ranges shown in the following table.

Encoding Compressed storage size Range of values that can be compressed (values outside the range are stored raw)
MOSTLY8 1 byte (8 bits) -128 to 127
MOSTLY16 2 bytes (16 bits) -32768 to 32767
MOSTLY32 4 bytes (32 bits) -2147483648 to +2147483647

For decimal values, ignore the decimal point to determine whether the value fits into the range. For example, 1,234.56 is treated as 123,456 and can be compressed in a MOSTLY32 column.

For example, the VENUEID column in the VENUE table is defined as a raw integer column, which means that its values consume 4 bytes of storage. However, the current range of values in the column is 0 to 309. Therefore, recreating and reloading this table with MOSTLY16 encoding for VENUEID would reduce the storage of every value in that column to 2 bytes.

If the VENUEID values referenced in another table were mostly in the range of 0 to 127, it might make sense to encode that foreign-key column as MOSTLY8. Before making the choice, run several queries against the referencing table data to find out whether the values mostly fall into the 8-bit, 16-bit, or 32-bit range.

The following table shows compressed sizes for specific numeric values when the MOSTLY8, MOSTLY16, and MOSTLY32 encodings are used:

Original value Original INT or BIGINT size (bytes) MOSTLY8 compressed size (bytes) MOSTLY16 compressed size (bytes) MOSTLY32 compressed size (bytes)
1 4 1 2 4
10 4 1 2 4
100 4 1 2 4
1000 4 Same as raw data size 2 4
10000 4 2 4
20000 4 2 4
40000 8 Same as raw data size 4
100000 8 4
2000000000 8 4