pyspark - Applying User Function to Entire Spark DataFrame column -
spark dataframe schema:
in [177]: testtbl.printschema() root |-- date: long (nullable = true) |-- close: double (nullable = true) |-- volume: double (nullable = true)
i wish apply scalar-valued function column of testtbl
. suppose wish calculate average of 'close' column. rdd
rdd.fold(0, lambda x,y: x+y)
but testtbl.close
not rdd,, column object limited functionality. rows of testtbl
rdds, columns not. how apply add
, or user function single column?
if want apply function entire column, have execute aggregation operation column.
for instance imagine want compute sum
of values in column values
. though, df
not aggregated data, valid apply aggregated functions dataframes
.
from pyspark.sql.functions import * df = sc.parallelize([(1,), (2,), (3,)]).todf(["values"]) df.agg(sum("values").alias("sum")).show() +---+ |sum| +---+ | 6| +---+
you can find example in pyspark's aggregation documentation.
for second part of question. can create user defined aggregated function, if i'm right applicable scala
.
Comments
Post a Comment