pyspark.sql.table_arg.TableArg.partitionBy#
- TableArg.partitionBy(*cols)[source]#
Partitions the data based on the specified columns.
This method partitions the table argument data by the specified columns. It must be called before orderBy() and cannot be called after withSinglePartition() has been called.
- Parameters
- colsstr,
Column, or list Column names or
Columnobjects to partition by.
- colsstr,
- Returns
TableArgA new TableArg instance with partitioning applied.
Examples
>>> from pyspark.sql.functions import udtf >>> >>> @udtf(returnType="key: int, value: string") ... class ProcessUDTF: ... def eval(self, row): ... yield row["key"], row["value"] ... >>> df = spark.createDataFrame( ... [(1, "a"), (1, "b"), (2, "c"), (2, "d")], ["key", "value"] ... ) >>> >>> # Partition by a single column >>> result = ProcessUDTF(df.asTable().partitionBy("key")) >>> result.show() +---+-----+ |key|value| +---+-----+ | 1| a| | 1| b| | 2| c| | 2| d| +---+-----+ >>> >>> # Partition by multiple columns >>> df2 = spark.createDataFrame( ... [(1, "x", 10), (1, "x", 20), (2, "y", 30)], ["key", "category", "value"] ... ) >>> result2 = ProcessUDTF(df2.asTable().partitionBy("key", "category")) >>> result2.show() +---+-----+ |key|value| +---+-----+ | 1| x| | 1| x| | 2| y| +---+-----+