TOKENIZATION
Splits text into smaller units, or tokens, such as individual words or terms.
Parameters
-
sourceColumn
– The name of an existing column. -
delimiter
— A custom delimiter that appears between tokenized words. (The default behavior is to separate each token by a space.) -
expandContractions
— IfENABLED
, expands contracted words. For example: "don't" becomes "do not". -
stemmingMode
— Splits text into smaller units or tokens, such as individual lowercase words or terms. Two stemming modes are available:PORTER
|LANCASTER
. -
stopWordRemovalMode
— Removes common words like a, an, the, and more. -
customStopWords
— ForStopWordRemovalMode
, allows you to specify a custom list of stop words. -
targetColumn
— The name of a column to contain the results.
Example
{
"Action": {
"Operation": "TOKENIZATION",
"Parameters": {
"customStopWords": "[]",
"delimiter": "- ",
"expandContractions": "ENABLED",
"sourceColumn": "dimensions",
"stemmingMode": "PORTER",
"stopWordRemovalMode": "DEFAULT",
"targetColumn": "dimensions_tokenized"
}
}
}