pyspark.sql.table_arg.TableArg.partitionBy#

TableArg.partitionBy(*cols)[source]#

Partitions the data based on the specified columns.

This method partitions the table argument data by the specified columns. It must be called before orderBy() and cannot be called after withSinglePartition() has been called.

Parameters
colsstr, Column, or list

Column names or Column objects to partition by.

Returns
TableArg

A new TableArg instance with partitioning applied.

Examples

>>> from pyspark.sql.functions import udtf
>>>
>>> @udtf(returnType="key: int, value: string")
... class ProcessUDTF:
...     def eval(self, row):
...         yield row["key"], row["value"]
...
>>> df = spark.createDataFrame(
...     [(1, "a"), (1, "b"), (2, "c"), (2, "d")], ["key", "value"]
... )
>>>
>>> # Partition by a single column
>>> result = ProcessUDTF(df.asTable().partitionBy("key"))
>>> result.show()
+---+-----+
|key|value|
+---+-----+
|  1|    a|
|  1|    b|
|  2|    c|
|  2|    d|
+---+-----+
>>>
>>> # Partition by multiple columns
>>> df2 = spark.createDataFrame(
...     [(1, "x", 10), (1, "x", 20), (2, "y", 30)], ["key", "category", "value"]
... )
>>> result2 = ProcessUDTF(df2.asTable().partitionBy("key", "category"))
>>> result2.show()
+---+-----+
|key|value|
+---+-----+
|  1|    x|
|  1|    x|
|  2|    y|
+---+-----+