Spark interview question for 2 year of experience

What is WholeStageCodeGen in Spark?
If Spark Job is running from a long time(X hours). What is the possible reason?
What is Lineage Graph?
Difference between groupByKey and reduceByKey?
How can you create an RDD for a text file?
SparkContext.textFile
How would you hint, minimum number of partitions while transformation ?
Ans: You can request for the minimum number of partitions, using the second input parameter to many transformations.
scala> sc.parallelize(1 to 100, 2) where 2 is default partition
You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?
Ans: number _2 in the name denotes 2 replicas
What is checkpointing?
Ans: Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system. You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.
What is DAGSchedular and how it performs?
Ans: DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling, i.e. after an RDD action has been called it becomes a job that is then transformed into a set of stages that are submitted as TaskSets for execution.DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially.
What Is Hive On Spark?
Ans: Hive is a component of Hortonworks‟ Data Platform (HDP). Hive provides an SQL-like interface to data stored in the HDP. Spark users will automatically get the complete set of Hive‟s rich features, including any new features that Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.

Hard Question

1. How to handle skew data in Spark?

Ans: For more Details: How to handle skew data in spark(Salting technique)

2.What is salting technique in Spark?

3. Optimize given spark code.

df=sparkSession.read.parquet("path of the file")
df1=df.groupBy("department_id").count()
df1=df1.filter("department_id='101'")
df1.write.save("output path")
df3=df1.join(df2,['department_id'],'inner')

Hard Question

You Might Also Like

Spark optimization techniques

Spark interview question for fresher

Union and UnionAll in Apache Spark DataFrame

Leave a Reply Cancel reply