hadoop - Apache Spark on EMR 10 node cluster for 150TB of data not completing -


i have s3 bucket, taking data , saving hdfs on different emr cluster. read these stored files stored on hdfs apache spark , perform joins , data filtering, save resultant dataset of around 150tb in csv format on hdfs. operation taking forever. using 64 executor, executor memory of 120gb , driver memory of 100gb. using databricks saving data in csv.

rest of hadoop setting of emr cluster on spark running default.

while running spark-submit gets following error:

error yarnscheduler: lost executor 5 on ip-xx-xx-xx.ec2.internal: container marked failed: container_14687884542720_0157_01_000006 on host: ip-xx-xx-xx.ec2.internal. exit status: 143. diagnostics: container killed on request. exit code 143 container exited non-zero exit code 143 killed external signal 

kindly point me in right direction, fix this


Comments

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -