pyspark - Applying User Function to Entire Spark DataFrame column -

August 15, 2013

spark dataframe schema:

in [177]: testtbl.printschema() root  |-- date: long (nullable = true)  |-- close: double (nullable = true)  |-- volume: double (nullable = true)

i wish apply scalar-valued function column of testtbl. suppose wish calculate average of 'close' column. rdd

rdd.fold(0, lambda x,y: x+y)

but testtbl.close not rdd,, column object limited functionality. rows of testtbl rdds, columns not. how apply add, or user function single column?

if want apply function entire column, have execute aggregation operation column.

for instance imagine want compute sum of values in column values. though, df not aggregated data, valid apply aggregated functions dataframes.

from pyspark.sql.functions import *  df = sc.parallelize([(1,), (2,), (3,)]).todf(["values"]) df.agg(sum("values").alias("sum")).show()  +---+ |sum| +---+ |  6| +---+

you can find example in pyspark's aggregation documentation.

for second part of question. can create user defined aggregated function, if i'm right applicable scala.

Search This Blog

Perl

pyspark - Applying User Function to Entire Spark DataFrame column -

Comments

Post a Comment

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -