Repartition
This will repartition/coalesce the input dataframe based on config.
Below different types of configurations which can be given:
Hash Repartitoning
Repartitions the data evenly across various partitions based on the key. Reshuffles the dataset.
Parameters
Parameter | Description | Required |
---|---|---|
Dataframe | Input dataframe | True |
Overwrite default partitions flag | Flag to overwrite default partitions | False |
Number of partitions | Integer value specifying number of partitions | False |
Repartition expression(s) | List of expressions to repartition by | True |
Generated Code
- Python
- Scala
def hashRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5, col("customer_id"))
object hashRepartition {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.repartition(5, col("customer_id"))
}
Random Repartitioning
Repartitions without data distribution defined. Reshuffles the dataset.
Parameters
Parameter | Description | Required |
---|---|---|
Dataframe | Input dataframe | True |
Number of partitions | Integer value specifying number of partitions | True |
Generated Code
- Python
- Scala
def randomRepartition(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartition(5)
object randomRepartition {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.repartition(5)
}
Range Repartitoning
Repartitions the data with tuples having keys within the same range on the same worker. Reshuffles the dataset.
Parameters
Parameter | Description | Required |
---|---|---|
Dataframe | Input dataframe | True |
Overwrite default partitions flag | Flag to overwrite default partitions | False |
Number of partitions | Integer value specifying number of partitions | False |
Repartition expression(s) with sorting | List of expressions to repartition by with corresponding sorting order | True |
Generated Code
- Python
- Scala
def RepartitionByRange(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.repartitionByRange(5, col("customer_id").asc())
object RepartitionByRange {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.repartitionByRange(5, col("customer_id").asc())
}
Coalesce
Reduces the number of partitions without shuffling the dataset.
Parameters
Parameter | Description | Required |
---|---|---|
Dataframe | Input dataframe | True |
Number of partitions | Integer value specifying number of partitions | True |
Generated Code
- Python
- Scala
def Coalesce(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.coalesce(5)
object Coalesce {
def apply(spark: SparkSession, in: DataFrame): DataFrame =
in.coalesce(5)
}