Spark join and types
Spark supports different join types as given below. Inner Join. Left / Left Outer Join. Right / Right Outer Join. Outer / Full Join. Cross Join. Left Anti Join. Left…
Spark supports different join types as given below. Inner Join. Left / Left Outer Join. Right / Right Outer Join. Outer / Full Join. Cross Join. Left Anti Join. Left…
Dropping Nulls: We can remove rows containing null values using the drop method. For example: df.dropna() Filling Nulls: We can replace null values with a specified constant or a calculated…
RDD, DataFrame and Dataset are the APIs or abstraction to deal with the data in Apache spark. There is different use case to pick one vs other in the project.…
For running the spark application, Initially we need to decide the the spark deployment mode. There are three types of deployment mode. 1.Local Mode: Local mode used for development, debugging…
Apache spark is a open source, general-purpose unified distributed data processing engine which is easily pluggable with different input and output source. It is in-memory and fast processing framework which…
What is the difference between RDD and Dataframe? What is the difference between Dataframe and Datasets? What is the difference between Mapreduce and Spark? OR How does Spark achieve 100x times faster…
This program provides basic understanding of the working of spark. In a data engineer interview, this program is frequently asked. Using spark RDD Python: from pyspark.sql import SparkSession sparkSession =…
Broadcast small dataset: Use Cache: Handle Skewed data: Minimize shuffle: Appropriate partition: Use Dataframe over RDDs: Avoid Custom UDFs: Use Right Number of resources: Join Optimise: Predicate Pushdown: Latest spark…
Out-of-memory error is the common error in the Spark. It happened on driver and executer nodes. Out-of-memory error on driver: On the driver node OOM error when we are using…
What is catalyst optimiser?