Sanjai Verma – Spark Example

Spark join and types

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

Spark supports different join types as given below. Inner Join. Left / Left Outer Join. Right / Right Outer Join. Outer / Full Join. Cross Join. Left Anti Join. Left…

Spark handling Null value

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

Dropping Nulls: We can remove rows containing null values using the drop method. For example: df.dropna() Filling Nulls: We can replace null values with a specified constant or a calculated…

Spark RDD vs DataFrame vs Dataset

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

RDD, DataFrame and Dataset are the APIs or abstraction to deal with the data in Apache spark. There is different use case to pick one vs other in the project.…

spark deployment modes

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

For running the spark application, Initially we need to decide the the spark deployment mode. There are three types of deployment mode. 1.Local Mode: Local mode used for development, debugging…

what is apache spark

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

Apache spark is a open source, general-purpose unified distributed data processing engine which is easily pluggable with different input and output source. It is in-memory and fast processing framework which…

spark interview questions

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

What is the difference between RDD and Dataframe? What is the difference between Dataframe and Datasets? What is the difference between Mapreduce and Spark? OR How does Spark achieve 100x times faster…

World count program in spark

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

This program provides basic understanding of the working of spark. In a data engineer interview, this program is frequently asked. Using spark RDD Python: from pyspark.sql import SparkSession sparkSession =…

Spark optimization techniques

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

Broadcast small dataset: Use Cache: Handle Skewed data: Minimize shuffle: Appropriate partition: Use Dataframe over RDDs: Avoid Custom UDFs: Use Right Number of resources: Join Optimise: Predicate Pushdown: Latest spark…

Spark Handle OOM error

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

Out-of-memory error is the common error in the Spark. It happened on driver and executer nodes. Out-of-memory error on driver: On the driver node OOM error when we are using…

Spark Catalyst optimiser internal working

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

What is catalyst optimiser?