Querying with user defined functions
User Defined Functions (UDF) in Amazon Athena allow you to create custom functions to process records or groups of records. A UDF accepts parameters, performs work, and then returns a result.
To use a UDF in Athena, you write a USING EXTERNAL FUNCTION
clause before a
SELECT
statement in a SQL query. The SELECT
statement
references the UDF and defines the variables that are passed to the UDF when the query runs.
The SQL query invokes a Lambda function using the Java runtime when it calls the UDF. UDFs
are defined within the Lambda function as methods in a Java deployment package. Multiple UDFs
can be defined in the same Java deployment package for a Lambda function. You also specify
the name of the Lambda function in the USING EXTERNAL FUNCTION
clause.
You have two options for deploying a Lambda function for Athena UDFs. You can deploy the function directly using Lambda, or you can use the AWS Serverless Application Repository. To find existing Lambda functions for UDFs, you can search the public AWS Serverless Application Repository or your private repository and then deploy to Lambda. You can also create or modify Java source code, package it into a JAR file, and deploy it using Lambda or the AWS Serverless Application Repository. For example Java source code and packages to get you started, see Creating and deploying a UDF using Lambda. For more information about Lambda, see AWS Lambda Developer Guide. For more information about AWS Serverless Application Repository, see the AWS Serverless Application Repository Developer Guide.
For an example that uses UDFs with Athena to translate and analyze
text, see the AWS Machine Learning Blog article Translate and analyze text using SQL functions with Amazon Athena, Amazon Translate, and
Amazon Comprehend
For an example of using UDFs to extend geospatial queries in Amazon Athena, see Extend geospatial queries in Amazon Athena with UDFs and AWS Lambda
Considerations and limitations
-
Built-in Athena functions – Built-in functions in Athena are designed to be highly performant. We recommend that you use built-in functions over UDFs when possible. For more information about built-in functions, see Functions in Amazon Athena.
-
Scalar UDFs only – Athena only supports scalar UDFs, which process one row at a time and return a single column value. Athena passes a batch of rows, potentially in parallel, to the UDF each time it invokes Lambda. When designing UDFs and queries, be mindful of the potential impact to network traffic of this processing.
-
UDF handler functions use abbreviated format – Use abbreviated format (not full format), for your UDF functions (for example,
package.Class
instead ofpackage.Class::method
). -
UDF methods must be lowercase – UDF methods must be in lowercase; camel case is not permitted.
-
UDF methods require parameters – UDF methods must have at least one input parameter. Attempting to invoke a UDF defined without input parameters causes a runtime exception. UDFs are meant to perform functions against data records, but a UDF without arguments takes in no data, so an exception occurs.
-
Java runtime support – Currently, Athena UDFs support the Java 8 and Java 11 runtimes for Lambda. For more information, see Building Lambda functions with Java in the AWS Lambda Developer Guide.
-
IAM permissions – To run and create UDF query statements in Athena, the IAM principal running the query must be allowed to perform actions in addition to Athena functions. For more information, see Example IAM permissions policies to allow Amazon Athena User Defined Functions (UDF).
-
Lambda quotas – Lambda quotas apply to UDFs. For more information, see Lambda quotas in the AWS Lambda Developer Guide.
-
Row-level filtering – Lake Formation row-level filtering is not supported for UDFs.
-
Views – You cannot use views with UDFs.
-
Known issues – For the most up-to-date list of known issues, see Limitations and issues
in the awslabs/aws-athena-query-federation section of GitHub.
Videos
Watch the following videos to learn more about using UDFs in Athena.
Video: Introducing User Defined Functions (UDFs) in Amazon Athena
The following video shows how you can use UDFs in Amazon Athena to redact sensitive information.
Note
The syntax in this video is prerelease, but the concepts are the same. Use Athena
without the AmazonAthenaPreviewFunctionality
workgroup.
Video: Translate, analyze, and redact text fields using SQL queries in Amazon Athena
The following video shows how you can use UDFs in Amazon Athena together with other AWS services to translate and analyze text.
Note
The syntax in this video is prerelease, but the concepts are the same. For the
correct syntax, see the related blog post Translate, redact, and analyze text using SQL functions with Amazon Athena, Amazon Translate,
and Amazon Comprehend
UDF query syntax
The USING EXTERNAL FUNCTION
clause specifies a UDF or multiple UDFs that
can be referenced by a subsequent SELECT
statement in the query. You need
the method name for the UDF and the name of the Lambda function that hosts the UDF. In
place of the Lambda function name, you can use the Lambda ARN. In cross-account scenarios,
the Lambda ARN is required.
Synopsis
USING EXTERNAL FUNCTION UDF_name
(variable1
data_type
[, variable2
data_type
][,...])
RETURNS data_type
LAMBDA 'lambda_function_name_or_ARN
'
[, EXTERNAL FUNCTION UDF_name2
(variable1
data_type
[, variable2
data_type
][,...])
RETURNS data_type
LAMBDA 'lambda_function_name_or_ARN
'[,...]]
SELECT [...] UDF_name
(expression
) [, UDF_name2
(expression
)] [...]
Parameters
- USING EXTERNAL FUNCTION
UDF_name
(variable1
data_type
[,variable2
data_type
][,...]) -
UDF_name
specifies the name of the UDF, which must correspond to a Java method within the referenced Lambda function. Eachvariable data_type
specifies a named variable and its corresponding data type that the UDF accepts as input. Thedata_type
must be one of the supported Athena data types listed in the following table and map to the corresponding Java data type.Athena data type Java data type TIMESTAMP
java.time.LocalDateTime (UTC)
DATE
java.time.LocalDate (UTC)
TINYINT
java.lang.Byte
SMALLINT
java.lang.Short
REAL
java.lang.Float
DOUBLE
java.lang.Double
DECIMAL (see
RETURNS
note)java.math.BigDecimal
BIGINT
java.lang.Long
INTEGER
java.lang.Int
VARCHAR
java.lang.String
VARBINARY
byte[]
BOOLEAN
java.lang.Boolean
ARRAY
java.util.List
ROW
java.util.Map<String, Object>
- RETURNS
data_type
-
data_type
specifies the SQL data type that the UDF returns as output. Athena data types listed in the table above are supported. For theDECIMAL
data type, use the syntaxRETURNS DECIMAL(
whereprecision
,scale
)precision
andscale
are integers. - LAMBDA '
lambda_function
' -
lambda_function
specifies the name of the Lambda function to be invoked when running the UDF. - SELECT [...]
UDF_name
(expression
) [...] -
The
SELECT
query that passes values to the UDF and returns a result.UDF_name
specifies the UDF to use, followed by anexpression
that is evaluated to pass values. Values that are passed and returned must match the corresponding data types specified for the UDF in theUSING EXTERNAL FUNCTION
clause.
Examples
For example queries based on the AthenaUDFHandler.java
Creating and deploying a UDF using Lambda
To create a custom UDF, you create a new Java class by extending the
UserDefinedFunctionHandler
class. The source code for the UserDefinedFunctionHandler.java
The steps in this section demonstrate writing and building a custom UDF Jar file using
Apache Maven
Steps to Create a Custom UDF for Athena Using Maven
Clone the SDK and prepare your development environment
Before you begin, make sure that git is installed on your system using sudo
yum install git -y
.
To install the AWS query federation SDK
-
Enter the following at the command line to clone the SDK repository. This repository includes the SDK, examples and a suite of data source connectors. For more information about data source connectors, see Using Amazon Athena Federated Query.
git clone https://github.com/awslabs/aws-athena-query-federation.git
To install prerequisites for this procedure
If you are working on a development machine that already has Apache Maven, the AWS CLI, and the AWS Serverless Application Model build tool installed, you can skip this step.
-
From the root of the
aws-athena-query-federation
directory that you created when you cloned, run the prepare_dev_env.shscript that prepares your development environment. -
Update your shell to source new variables created by the installation process or restart your terminal session.
source ~/.profile
Important
If you skip this step, you will get errors later about the AWS CLI or AWS SAM build tool not being able to publish your Lambda function.
Create your Maven project
Run the following command to create your Maven project. Replace
groupId
with the unique ID of your organization, and
replace my-athena-udf
with the name of your application
For more information, see How do I make my first Maven project?
mvn -B archetype:generate \ -DarchetypeGroupId=org.apache.maven.archetypes \ -DgroupId=
groupId
\ -DartifactId=my-athena-udfs
Add dependencies and plugins to your Maven project
Add the following configurations to your Maven project pom.xml
file.
For an example, see the pom.xml
<properties> <aws-athena-federation-sdk.version>2021.6.1</aws-athena-federation-sdk.version> </properties> <dependencies> <dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-athena-federation-sdk</artifactId> <version>${aws-athena-federation-sdk.version}</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.2.1</version> <configuration> <createDependencyReducedPom>false</createDependencyReducedPom> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
Write Java code for the UDFs
Create a new class by extending UserDefinedFunctionHandler.java
In the following example, two Java methods for UDFs, compress()
and
decompress()
, are created inside the class
MyUserDefinedFunctions
.
*package *com.mycompany.athena.udfs; public class MyUserDefinedFunctions extends UserDefinedFunctionHandler { private static final String SOURCE_TYPE = "MyCompany"; public MyUserDefinedFunctions() { super(SOURCE_TYPE); } /** * Compresses a valid UTF-8 String using the zlib compression library. * Encodes bytes with Base64 encoding scheme. * * @param input the String to be compressed * @return the compressed String */ public String compress(String input) { byte[] inputBytes = input.getBytes(StandardCharsets.UTF_8); // create compressor Deflater compressor = new Deflater(); compressor.setInput(inputBytes); compressor.finish(); // compress bytes to output stream byte[] buffer = new byte[4096]; ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(inputBytes.length); while (!compressor.finished()) { int bytes = compressor.deflate(buffer); byteArrayOutputStream.write(buffer, 0, bytes); } try { byteArrayOutputStream.close(); } catch (IOException e) { throw new RuntimeException("Failed to close ByteArrayOutputStream", e); } // return encoded string byte[] compressedBytes = byteArrayOutputStream.toByteArray(); return Base64.getEncoder().encodeToString(compressedBytes); } /** * Decompresses a valid String that has been compressed using the zlib compression library. * Decodes bytes with Base64 decoding scheme. * * @param input the String to be decompressed * @return the decompressed String */ public String decompress(String input) { byte[] inputBytes = Base64.getDecoder().decode((input)); // create decompressor Inflater decompressor = new Inflater(); decompressor.setInput(inputBytes, 0, inputBytes.length); // decompress bytes to output stream byte[] buffer = new byte[4096]; ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(inputBytes.length); try { while (!decompressor.finished()) { int bytes = decompressor.inflate(buffer); if (bytes == 0 && decompressor.needsInput()) { throw new DataFormatException("Input is truncated"); } byteArrayOutputStream.write(buffer, 0, bytes); } } catch (DataFormatException e) { throw new RuntimeException("Failed to decompress string", e); } try { byteArrayOutputStream.close(); } catch (IOException e) { throw new RuntimeException("Failed to close ByteArrayOutputStream", e); } // return decoded string byte[] decompressedBytes = byteArrayOutputStream.toByteArray(); return new String(decompressedBytes, StandardCharsets.UTF_8); } }
Build the JAR file
Run mvn clean install
to build your project. After it successfully
builds, a JAR file is created in the target
folder of your project
named
,
where artifactId
-version
.jarartifactId
is the name you provided in the Maven
project, for example, my-athena-udfs
.
Deploy the JAR to AWS Lambda
You have two options to deploy your code to Lambda:
-
Deploy Using AWS Serverless Application Repository (Recommended)
-
Create a Lambda Function from the JAR file
Option 1: Deploying to the AWS Serverless Application Repository
When you deploy your JAR file to the AWS Serverless Application Repository, you create an AWS SAM template YAML
file that represents the architecture of your application. You then specify this
YAML file and an Amazon S3 bucket where artifacts for your application are uploaded
and made available to the AWS Serverless Application Repository. The procedure below uses the publish.shathena-query-federation/tools
directory of the Athena Query
Federation SDK that you cloned earlier.
For more information and requirements, see Publishing applications in the AWS Serverless Application Repository Developer Guide, AWS SAM template concepts in the AWS Serverless Application Model Developer Guide, and Publishing serverless applications using the AWS SAM CLI.
The following example demonstrates parameters in a YAML file. Add similar
parameters to your YAML file and save it in your project directory. See athena-udf.yaml
Transform: 'AWS::Serverless-2016-10-31' Metadata: 'AWS::ServerlessRepo::Application': Name:
MyApplicationName
Description: 'The description I write for my application
' Author: 'Author Name
' Labels: - athena-federation SemanticVersion: 1.0.0 Parameters: LambdaFunctionName: Description: 'The name of the Lambda function that will contain your UDFs.
' Type: String LambdaTimeout: Description: 'Maximum Lambda invocation runtime in seconds. (min 1 - 900 max)' Default: 900 Type: Number LambdaMemory: Description: 'Lambda memory in MB (min 128 - 3008 max).' Default: 3008 Type: Number Resources: ConnectorConfig: Type: 'AWS::Serverless::Function' Properties: FunctionName: !Ref LambdaFunctionName Handler: "full.path.to.your.handler. For example, com.amazonaws.athena.connectors.udfs.MyUDFHandler
" CodeUri: "Relative path to your JAR file. For example, ./target/athena-udfs-1.0.jar
" Description: "My description of the UDFs that this Lambda function enables.
" Runtime: java8 Timeout: !Ref LambdaTimeout MemorySize: !Ref LambdaMemory
Copy the publish.sh
script to the project directory where you
saved your YAML file, and run the following command:
./publish.sh
MyS3Location
MyYamlFile
For example, if your bucket location is
s3://mybucket/mysarapps/athenaudf
and your YAML file was saved
as my-athena-udfs.yaml
:
./publish.sh mybucket/mysarapps/athenaudf my-athena-udfs
To create a Lambda function
-
Open the Lambda console at https://console.aws.amazon.com/lambda/
, choose Create function, and then choose Browse serverless app repository -
Choose Private applications, find your application in the list, or search for it using key words, and select it.
-
Review and provide application details, and then choose Deploy.
You can now use the method names defined in your Lambda function JAR file as UDFs in Athena.
Option 2: Creating a Lambda function directly
You can also create a Lambda function directly using the console or AWS CLI. The
following example demonstrates using the Lambda create-function
CLI
command.
aws lambda create-function \ --function-name
MyLambdaFunctionName
\ --runtime java8 \ --role arn:aws:iam::1234567890123:role/my_lambda_role
\ --handlercom.mycompany.athena.udfs.MyUserDefinedFunctions
\ --timeout 900 \ --zip-file fileb://./target/my-athena-udfs-1.0-SNAPSHOT.jar