pyspark - Spark randomly drop rows -


i'm testing classifier on missing data , want randomly delete rows in spark.

i want every nth row, delete 20 rows.

what best way this?

if random can use sample method lets take fraction of dataframe. however, if idea split data training , validation can use randomsplit.

another option less elegant convert dataframe rdd , use zipwithindex , filter index, maybe like:

df.rdd.zipwithindex().filter(lambda x: x[-1] % 20 != 0) 

Comments

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -