Skip to main content

Deduplicate

Removes rows with duplicate values of specified columns.

Parameters

ParameterDescriptionRequired
DataframeInput dataframeTrue
Row to keep- Any: Keeps any one row among duplicates. Uses underlying dropDuplicates construct
- First: Keeps first occurrence of the duplicate row
- Last: Keeps last occurrence of the duplicate row
- Unique Only: Keeps rows that don't have duplicates
Default is Any
True
Deduplicate columnsColumns to consider while removing duplicate rowsTrue
Order columnsColumns to sort dataframe on before de-duping in case of First and Last rows to keepFalse

Examples


Rows to keep - Any

Example usage of Deduplicate

def dedup(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.dropDuplicates(["tran_id"])

Rows to keep - First

Example usage of Deduplicate - First

def earliest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"row_number",
row_number()\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())\
.rowsBetween(Window.unboundedPreceding, Window.currentRow))
)\
.filter(col("row_number") == lit(1))\
.drop("row_number")

Rows to keep - Last

Example usage of Deduplicate - Last

def latest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"row_number",
row_number()\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())\
.rowsBetween(Window.unboundedPreceding, Window.currentRow))
)\
.withColumn(
"count",
count("*")\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
)\
.filter(col("row_number") == col("count"))\
.drop("row_number")\
.drop("count")

Rows to keep - Unique Only

Example usage of Deduplicate - Unique

def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"count",
count("*")\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
)\
.filter(col("count") == lit(1))\
.drop("count")