what is apache spark
Apache spark is a open source, general-purpose unified distributed data processing engine which is easily pluggable with different input and output source. It is in-memory and fast processing framework which…
Apache spark is a open source, general-purpose unified distributed data processing engine which is easily pluggable with different input and output source. It is in-memory and fast processing framework which…
What is the difference between RDD and Dataframe? What is the difference between Dataframe and Datasets? What is the difference between Mapreduce and Spark? OR How does Spark achieve 100x times faster…
This program provides basic understanding of the working of spark. In a data engineer interview, this program is frequently asked. Using spark RDD Python: from pyspark.sql import SparkSession sparkSession =…
Broadcast small dataset: Use Cache: Handle Skewed data: Minimize shuffle: Appropriate partition: Use Dataframe over RDDs: Avoid Custom UDFs: Use Right Number of resources: Join Optimise: Predicate Pushdown: Latest spark…
Out-of-memory error is the common error in the Spark. It happened on driver and executer nodes. Out-of-memory error on driver: On the driver node OOM error when we are using…
What is catalyst optimiser?