RemoveDuplicates class - AWS Glue

RemoveDuplicates class

The RemoveDuplicates transform deletes an entire row, if a duplicate value is encountered in a selected source column.

Example

from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (105.111, 13.12), (13.12, 13.12), (None, 13.12), (13.12, 13.12), (None, 13.12), ], ["source_column_1", "source_column_2"], ) try: df_output = data_quality.RemoveDuplicates.apply( data_frame=input_df, spark_context=sc, source_column="source_column_1" ) except: print("Unexpected Error happened ") raise

Output

The output will be a PySpark DataFrame with duplicates removed based on the source_column_1 column. The resulting `df_output` DataFrame will contain the following rows:

``` +---------------+---------------+ |source_column_1|source_column_2| +---------------+---------------+ | 105.111| 13.12| | 13.12| 13.12| | null| 13.12| +---------------+---------------+ ```

Note that the rows with source_column_1 values of `13.12` and `null` appear only once in the output DataFrame, as the duplicates have been removed based on the source_column_1 column.

Methods

__call__(spark_context, data_frame, source_column)

The RemoveDuplicates transform deletes an entire row, if a duplicate value is encountered in a selected source column.

  • source_column – The name of an existing column.

apply(cls, *args, **kwargs)

Inherited from GlueTransform apply.

name(cls)

Inherited from GlueTransform name.

describeArgs(cls)

Inherited from GlueTransform describeArgs.

describeReturn(cls)

Inherited from GlueTransform describeReturn.

describeTransform(cls)

Inherited from GlueTransform describeTransform.

describeErrors(cls)

Inherited from GlueTransform describeErrors.

describe(cls)

Inherited from GlueTransform describe.