Feature encoding in Neptune ML - Amazon Neptune

Feature encoding in Neptune ML

Property values come in different formats and data types. To achieve good performance in machine learning, it is essential to convert those values to numerical encodings known as features.

Neptune ML performs feature extraction and encoding as part of the data-export and data-processing steps, using feature-encoding techniques described here.

Note

If you plan to implement your own feature encoding in a custom model implementation, you can disable the automatic feature encoding in the data preprocessing stage by selecting none as the feature encoding type. No feature encoding then occurs on that node or edge property, and instead the raw property values are parsed and saved in a dictionary. Data preprocessing still creates the DGL graph from the exported dataset, but the constructed DGL graph doesn't have the pre-processed features for training.

You should use this option only if you plan to perform your custom feature encoding as part of custom model training. See Custom models in Neptune ML for details.

Categorical features in Neptune ML

A property that can take one or more distinct values from a fixed list of possible values is a categorical feature. In Neptune ML, categorical features are encoded using one-hot encoding. The following example shows how the property name of different foods is one-hot encoded according to its category:

Food Veg. Meat Fruit Encoding --------- ---- ---- ----- -------- Apple 0 0 1 001 Chicken 0 1 0 010 Broccoli 1 0 0 100
Note

The maximum number of categories in any categorical feature is 100. If a property has more than 100 categories of value, only the most common 99 of them are placed in distinct categories, and the rest are placed in a special category named OTHER.

Numerical features in Neptune ML

Any property whose values are real numbers can be encoded as a numerical feature in Neptune ML. Numerical features are encoded using floating-point numbers.

You can specify a data-normalization method to use when encoding numerical features, like this: "norm": "normalization technique". The following normalization techniques are supported:

  • "none"   –   Don't normalize the numerical values during encoding.

  • "min-max"   –   Normalize each value by subtracting the minimum value from it and then dividing it by the difference between the maximum value and the minimum.

  • "standard"   –   Normalize each value by dividing it by the sum of all the values.

Bucket-numerical features in Neptune ML

Rather than representing a numerical property using raw numbers, you can condense numerical values into categories. For example, you could divide people's ages into categories such as kids (0-20), young adults (20-40), middle-aged people (40-60) and elders (from 60 on). Using these numerical buckets, you would be transforming a numerical property into a kind of categorical feature.

In Neptune ML, you can cause a numerical property to be encoded as a bucket-numerical feature, you must provide two things:

  • A numerical range in the form, "range": [a, b] , where a and b are integers.

  • A bucket count, in the form "bucket_cnt": c , where c is the number of buckets, also an integer.

Neptune ML then calculates the size of each bucket as  ( b - a ) / c , and encodes each numeric value as the number of whatever bucket it falls into. Any value less than a is considered to belong in the first bucket, and any value greater than b is considered to belong in the last bucket.

You can also, optionally, make numeric values fall into more than one bucket, by specifying a slide-window size, like this: "slide_window_size": s , where s is a number. Neptune ML then transforms each numeric value v of the property into a range from v - s/2 through v + s/2 , and assigns the value v to every bucket that the range covers.

Finally, you can also optionally provide a way of filling in missing values for numerical features and bucket-numerical features. You do this using "imputer": "imputation technique ", where the imputation technique is one of "mean", "median", or "most-frequent". If you don't specify an imputer, a missing value can cause processing to halt.

Text Word2Vec features in Neptune ML

Neptune ML can convert a string property value consisting of a sequence of tokens into a text_word2vec feature. This encodes the tokens in the string as a dense vector using one of the spaCy trained models (Neptune ML currently only supports the English en_core_web_lg model).

Text TF-IDF features in Neptune ML

TF-IDF (term frequency – inverse document frequency) is a numerical value intended to measure how important a word is in a document set. It is calculated by dividing the number of times a word appears in a given property value by the total number of such property values that it appears in.

For example, if the word "kiss" appears twice in a given movie title (say, "kiss kiss bang bang"), and "kiss" appears in the title of 4 movies in all, then the TF-IDF value of "kiss" in the "kiss kiss bang bang" title would be 2 / 4 .

Neptune ML can encode sentences or other free-form text property values as text_tfidf features. This encoding converts the sequence of words in the text into a numeric vector using a TF-IDF vectorizer, then followed by a dimensionality-reduction operation.

The vector that is initially created has d dimensions, where d is the number of unique terms in all property values of that type. The dimensionality-reduction operation uses a random sparse projection to reduce that number to a maximum of 100. The vocabulary of a graph is then generated by merging all the text_tfidf features in it.

You can control the TF-IDF vectorizer in several ways:

  • max_features   –   Using the max_features parameter, you can limit the number of terms in text_tfidf features to the most common ones. For example, if you set max_features to 100, only the top 100 most commonly used terms are included. The default value for max_features if you don't explicitly set it is 5,000.

  • min_df   –   Using the min_df parameter, you can limit the number of terms in text_tfidf features to ones having at least a specified document frequency. For example, if you set min_df to 5, only terms that appear in at least 5 different property values are used. The default value for min_df if you don't explicitly set it is 2.

  • ngram_range   –   The ngram_range parameter determines what combinations of words are treated as terms. For example, if you set ngram_range to [2, 4], the following 6 terms would be found in the "kiss kiss bang bang" title:

    • 2-word terms:  "kiss kiss", "kiss bang", and "bang bang".

    • 3-word terms:  "kiss kiss bang" and "kiss bang bang".

    • 4-word terms:  "kiss kiss bang bang".

    The default setting for ngram_range is [1, 1].

Datetime features in Neptune ML

Neptune ML can convert parts of datetime property values into categorical features by encoding them as one-hot arrays. Use the datetime_parts parameter to specify one or more of the following parts to encode: ["year", "month", "weekday", "hour"]. If you don't set datetime_parts, by default all four parts are encoded.

For example, if the range of datetime values spans the years 2010 through 2012, the four parts of the datetime entry 2011-04-22 01:16:34 are as follows:

  • year   –   [0, 1, 0].

    Since there are only 3 years in the span (2010, 2011, and 2012), the one-hot array has three entries, one for each year.

  • month   –   [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0].

    Here, the one-hot array has an entry for each month of the year.

  • weekday   –   [0, 0, 0, 0, 1, 0, 0].

    The ISO 8601 standard states that Monday is the first day of the week, and since April 22, 2011 was a Friday, the corresponding one-hot weekday array is hot in the fifth position.

  • hour   –   [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

    The hour 1 AM is set in a 24-member one-hot array.

Day of the month, minute, and second are not encoded categorically.

If the total datetime range in question only includes dates within a single year, no year array is encoded.

You can specify an imputation strategy to fill in missing datetime values, using the imputer parameter and one of the strategies available for numerical features.

Auto feature encoding in Neptune ML

Instead of manually specifying the feature encoding methods to use for the properties in your graph, you can set auto as a feature encoding method. Neptune ML then attempts to infer the best feature encoding for each property based on its underlying data type.

Here are some of the heuristics that Neptune ML uses in selecting the appropriate feature encodings:

  • If the property has only numeric values and can be cast into numeric data types, then Neptune ML generally encodes it as a numeric value. However, if the number of unique values for the property is less than 10% of the total number of values and the cardinality of those unique values is less than 100, then Neptune ML uses a categorical encoding.

  • If the property values can be cast to a datetime type, then Neptune ML encodes them as a datetime feature.

  • If the property values can be coerced to booleans (1/0 or True/False), then Neptune ML uses category encoding.

  • If the property is a string with more than 10% of its values unique, and the average number of tokens per value is greater than or equal to 3, the Neptune ML infers the property type to be text and uses the text_word2vec encoding.

  • If the property is a string not classified as a text feature then Neptune ML presumes it to be a categorical feature and uses category encoding.

  • If each node has its own unique value for a property that is inferred to be a category feature, Neptune ML drops the property from the training graph because it is probably an ID that would not be informative for learning.

  • If the property is known to contain valid Neptune separators such as semicolons (";"), then Neptune ML can only treat the property as MultiNumerical or MultiCategorical.

    • Neptune ML first tries to encode the values as numeric features. if this succeeds, Neptune ML uses numerical encoding to create numeric vector features.

    • Otherwise, Neptune ML encodes the values as multi-categorical.

  • If Neptune ML cannot infer the data type of a property's values, Neptune MLdrops the property from the training graph.