Spark RDD vs DataFrame vs Dataset

RDD, DataFrame and Dataset are the APIs or abstraction to deal with the data in Apache spark. There is different use case to pick one vs other in the project.

What is RDD?

RDD stand for Resilient Distributed Dataset, is core data structure of Spark. It is collection of partition object which is distributed across all node of the cluster. Data processing happened parallel across all the node.

If any node fail due to any reason at the time of data processing, the RDD is capable enough to recover it by running all the transformation running on the other node which is running.

There are three ways to create the RDD

1.Using existing collection of data.

sparkSession = SparkSession.builder.getOrCreate()

my_list=[1,2,3,4,5,6,7,8,9,10]

num_rdd = sparkSession.sparkContext.parallelize(my_list)

2. Reading file from external source.

sparkSession = SparkSession.builder.getOrCreate()
rdd_file = sparkSession.sparkContext.textFile(“path of the file”)

3.New RDD from Existing RDDs.

rdd1 = num_rdd.map(lambda x: x*x)

What is DataFrame?

Leave a Reply