pyspark.sql.table_arg.TableArg.partitionBy#

TableArg.partitionBy(*cols)[source]#

Partitions the data based on the specified columns.

This method partitions the table argument data by the specified columns. It must be called before orderBy() and cannot be called after withSinglePartition() has been called.

Parameters

colsstr, Column, or list: Column names or Column objects to partition by.

Returns

TableArg: A new TableArg instance with partitioning applied.

Examples

>>> from pyspark.sql.functions import udtf
>>>
>>> @udtf(returnType="key: int, value: string")
... class ProcessUDTF:
...     def eval(self, row):
...         yield row["key"], row["value"]
...
>>> df = spark.createDataFrame(
...     [(1, "a"), (1, "b"), (2, "c"), (2, "d")], ["key", "value"]
... )
>>>
>>> # Partition by a single column
>>> result = ProcessUDTF(df.asTable().partitionBy("key"))
>>> result.show()
+---+-----+
|key|value|
+---+-----+
|  1|    a|
|  1|    b|
|  2|    c|
|  2|    d|
+---+-----+
>>>
>>> # Partition by multiple columns
>>> df2 = spark.createDataFrame(
...     [(1, "x", 10), (1, "x", 20), (2, "y", 30)], ["key", "category", "value"]
... )
>>> result2 = ProcessUDTF(df2.asTable().partitionBy("key", "category"))
>>> result2.show()
+---+-----+
|key|value|
+---+-----+
|  1|    x|
|  1|    x|
|  2|    y|
+---+-----+