Spark RDD vs DataFrame vs Dataset

Post author:Sanjai Verma
Post published:January 31, 2024
Post category:spark
Post comments:0 Comments

RDD, DataFrame and Dataset are the APIs or abstraction to deal with the data in Apache spark. There is different use case to pick one vs other in the project.

What is RDD?

RDD stand for Resilient Distributed Dataset, is core data structure of Spark. It is collection of partition object which is distributed across all node of the cluster. Data processing happened parallel across all the node.

If any node fail due to any reason at the time of data processing, the RDD is capable enough to recover it by running all the transformation running on the other node which is running.

There are three ways to create the RDD

1.Using existing collection of data.

sparkSession = SparkSession.builder.getOrCreate()

my_list=[1,2,3,4,5,6,7,8,9,10]

num_rdd = sparkSession.sparkContext.parallelize(my_list)

2. Reading file from external source.

sparkSession = SparkSession.builder.getOrCreate()
rdd_file = sparkSession.sparkContext.textFile(“path of the file”)

3.New RDD from Existing RDDs.

rdd1 = num_rdd.map(lambda x: x*x)

What is DataFrame?

You Might Also Like

Spark interview question for 2 year of experience

OLTP vs OLAP

Different file format used in spark

Leave a Reply Cancel reply