Skip to main content

Repartition

This will repartition/coalesce the input dataframe based on config.

Below different types of configurations which can be given:

Hash Repartitoning

Repartitions the data evenly across various partitions based on the key. Reshuffles the dataset.

Parameters

ParameterDescriptionRequired
DataframeInput dataframeTrue
Overwrite default partitions flagFlag to overwrite default partitionsFalse
Number of partitionsInteger value specifying number of partitionsFalse
Repartition expression(s)List of expressions to repartition byTrue

Generated Code

def hashRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5, col("customer_id"))

Random Repartitioning

Repartitions without data distribution defined. Reshuffles the dataset.

Parameters

ParameterDescriptionRequired
DataframeInput dataframeTrue
Number of partitionsInteger value specifying number of partitionsTrue

Generated Code

def randomRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5)

Range Repartitoning

Repartitions the data with tuples having keys within the same range on the same worker. Reshuffles the dataset.

Parameters

ParameterDescriptionRequired
DataframeInput dataframeTrue
Overwrite default partitions flagFlag to overwrite default partitionsFalse
Number of partitionsInteger value specifying number of partitionsFalse
Repartition expression(s) with sortingList of expressions to repartition by with corresponding sorting orderTrue

Generated Code

def RepartitionByRange(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartitionByRange(5, col("customer_id").asc())

Coalesce

Reduces the number of partitions without shuffling the dataset.

Parameters

ParameterDescriptionRequired
DataframeInput dataframeTrue
Number of partitionsInteger value specifying number of partitionsTrue

Generated Code

def Coalesce(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.coalesce(5)

Example