Spark interview question for 2 year of experience

  1. What is WholeStageCodeGen in Spark?
  2. If Spark Job is running from a long time(X hours). What is the possible reason?
  3. What is Lineage Graph?
  4. Difference between groupByKey and reduceByKey?
  5. How can you create an RDD for a text file? 
     SparkContext.textFile  
  6. How would you hint, minimum number of partitions while transformation ?
    Ans: You can request for the minimum number of partitions, using the second input parameter  to many transformations.  
    scala> sc.parallelize(1 to 100, 2) where 2 is default partition
  7.  You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ? 
    Ans: number _2 in the name denotes 2 replicas  
  8. What is checkpointing?  
    Ans: Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable  distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate  RDD data to a reliable distributed file system.  You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file  inside the checkpoint directory and all references to its parent RDDs will be removed. This  function has to be called before any job has been executed on this RDD.
  9.  What is DAGSchedular and how it performs?
    Ans: DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented  scheduling, i.e. after an RDD action has been called it becomes a job that is then transformed into a set of stages that are submitted as TaskSets for execution.DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent  events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes  sequentially. 
  10. What Is Hive On Spark? 
    Ans:  Hive is a component of Hortonworks‟ Data Platform (HDP). Hive provides an SQL-like interface to  data stored in the HDP. Spark users will automatically get the complete set of Hive‟s rich features,  including any new features that Hive might introduce in the future. 
    The main task around implementing the Spark execution engine for Hive lies in query planning,  where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark  can execute. It also includes query execution, where the generated Spark plan gets actually  executed in the Spark cluster. 

Hard Question

1. How to handle skew data in Spark?

Ans: For more Details: How to handle skew data in spark(Salting technique)

2.What is salting technique in Spark?

Ans: Read more about salting Details

3. Optimize given spark code.

df=sparkSession.read.parquet("path of the file")
df1=df.groupBy("department_id").count()
df1=df1.filter("department_id='101'")
df1.write.save("output path")
df3=df1.join(df2,['department_id'],'inner')

Leave a Reply