pyspark - Spark randomly drop rows -
i'm testing classifier on missing data , want randomly delete rows in spark.
i want every nth row, delete 20 rows.
what best way this?
if random can use sample method lets take fraction of dataframe
. however, if idea split data training
, validation
can use randomsplit.
another option less elegant convert dataframe
rdd
, use zipwithindex , filter index
, maybe like:
df.rdd.zipwithindex().filter(lambda x: x[-1] % 20 != 0)
Comments
Post a Comment