The features field in neptune_ml - Amazon Neptune

The features field in neptune_ml

Property values and RDF literals come in different formats and data types. To achieve good performance in machine learning, it is essential to convert those values to numerical encodings known as features.

Neptune ML performs feature extraction and encoding as part of the data-export and data-processing steps, as described in Feature encoding in Neptune ML.

The features field contains a JSON array of node property features. Objects in the array can contain the following fields:

The node field in features

The node field specifies a property-graph label of a feature vertex. For example:

"node": "Person"

If a vertex has multiple labels, use an array to contain them. For example:

"node": ["Admin", "Person"]

The property field in features

Use the property parameter to specify a property of the vertex identified by the node parameter. For example:

"property" : "age"

Possible values of the type field for features

The type parameter specifies the type of feature being defined. For example:

"type": "bucket_numerical"

Possible values of the type parameter

  • "auto"   –   Specifies that Neptune ML should automatically detect the property type and apply a proper feature encoding. An auto feature can also have an optional separator field.

    See Auto feature encoding in Neptune ML.

  • "category"   –   This feature encoding represents a property value as one of a number of categories. In other words, the feature can take one or more discrete values. A category feature can also have an optional separator field.

    See Categorical features in Neptune ML.

  • "numerical"   –   This feature encoding represents numerical property values as numbers in a continuous interval where "greater than" and "less than" have meaning.

    A numerical feature can also have optional norm, imputer, and separator fields.

    See Numerical features in Neptune ML.

  • "bucket_numerical"   –   This feature encoding divides numerical property values into a set of buckets or categories.

    For example, you could encode people's ages in 4 buckets: kids (0-20), young-adults (20-40), middle-aged (40-60), and elders (60 and up).

    A bucket_numerical feature requires a range and a bucket_cnt field, and can optionally also include an imputer and/or slide_window_size field.

    See Bucket-numerical features in Neptune ML.

  • "datetime"   –   This feature encoding represents a datetime property value as an array of these categorical features: year, month, weekday, and hour.

    One or more of these four categories can be eliminated using the datetime_parts parameter.

    See Datetime features in Neptune ML.

  • "text_tfidf"   –   This feature encoding converts property values that consist of sentences or free-form text into numeric vectors using a TF-IDF vectorizer.

    A text_tfidf feature encoding must also have an ngram_range field, a min_df field, and a max_features field to properly define it.

    See Text TF-IDF features in Neptune ML.

  • "text_word2vec"   –   This feature encoding converts property values that consist of sentences or free-form text into numeric vectors using Word2Vec models.

    A text_word2vec feature can also have an optional language field.

    See Text Word2Vec features in Neptune ML.

  • "none"   –   Using the none type causes no feature encoding to occur. The raw property values are parsed and saved instead.

    Use none only if you plan to perform your own custom feature encoding as part of custom model training.

The norm field

This field is required for numerical features. It specifies a normalization method to use on numeric values:

"norm": "min-max"

The following normalization methods are supported:

  • "min-max"   –   Normalize each value by subtracting the minimum value from it and then dividing it by the difference between the maximum value and the minimum.

  • "standard"   –   Normalize each value by dividing it by the sum of all the values.

  • "none"   –   Don't normalize the numerical values during encoding.

See Numerical features in Neptune ML.

The language field

This field is required for text_word2vec features. It specifies the name of the language model used to encode the property value string:

"language" : "en_core_web_lg"

Neptune ML currently only supports: "en_core_web_lg" (for English). The specific language model can only work with its target language, so the output embedding is not guaranteed to be valid if you supply other languages.

See Text Word2Vec features in Neptune ML.

The separator field

This field is used optionally with category, numerical and auto features. It specifies a character that can be used to split a property value into multiple categorical values or numerical values:

"separator": ";"

Only use the separator field when the property stores multiple delimited values in a single string, such as "Actor;Director" or "0.1;0.2".

See Categorical features, Numerical features, and Auto encoding.

The range field

This field is required for bucket_numerical features. It specifies the range of numerical values that are to be divided into buckets, in the format [lower-bound, upper-bound]:

"range" : [20, 100]

If a property value is smaller than the lower bound then it is assigned to the first bucket, or if it's larger than the upper bound, it's assigned to the last bucket.

See Bucket-numerical features in Neptune ML.

The bucket_cnt field

This field is required for bucket_numerical features. It specifies the number of buckets that the numerical range defined by the range parameter should be divided into:

"bucket_cnt": 10

See Bucket-numerical features in Neptune ML.

The slide_window_size field

This field is used optionally with bucket_numerical features to assign values to more than one bucket:

"slide_window_size": 5

The way a slide window works is that Neptune ML takes the window size s and transforms each numeric value v of a property into a range from v - s/2 through v + s/2 . The value is then assigned to every bucket that the range overlaps.

See Bucket-numerical features in Neptune ML.

The imputer field

This field is used optionally with numerical and bucket_numerical features to provide an imputation technique for filling in missing values:

"imputer": "mean"

The supported imputation techniques are:

  • "mean"

  • "median"

  • "most-frequent"

If you don't include the imputer parameter, data preprocessing halts and exits when a missing value is encountered.

See Numerical features in Neptune ML and Bucket-numerical features in Neptune ML.

The max_features field

This field is used optionally by text_tfidf features to specify the maximum number of terms to encode:

"max_features": 100

A setting of 100 causes the TF-IDF vectorizer to encode only the 100 most common terms. The default value if you don't include max_features is 5,000.

See Text TF-IDF features in Neptune ML.

The min_df field

This field is used optionally by text_tfidf features to specify the minimum document frequency of terms to encode:

"min_df": 5

A setting of 5 indicates that a term must appear in at least 5 different property values in order to be encoded.

The default value if you don't include the min_df parameter is 2.

See Text TF-IDF features in Neptune ML.

The ngram_range field

This field is used optionally by text_tfidf features to specify what size sequences of words or tokens should be considered as potential individual terms to encode:

"ngram_range": [2, 4]

The value [2, 4] specifies that sequences of 2, 3 and 4 words should be considered as potential individual terms.

The default if you don't explicitly set ngram_range is [1, 1], meaning that only single words or tokens are considered as terms to encode.

See Text TF-IDF features in Neptune ML.

The datetime_parts field

This field is used optionally by datetime features to specify which parts of the datetime value to encode categorically:

"datetime_parts": ["weekday", "hour"]

If you don't include datetime_parts, by default Neptune ML encodes the year, month, weekday and hour parts of the datetime value. The value ["weekday", "hour"] indicates that only the weekday and hour of datetime values should be encoded categorically in the feature.

If one of the parts does not have more than one unique value in the training set, it is not encoded.

See Datetime features in Neptune ML.