In Spark, SparkContext.parallelize function can be used to convert list of objects to RDD and then RDD can be converted to DataFrame object through SparkSession.
In PySpark, we can convert a Python list to RDD using SparkContext.parallelize function.
+----------+-----+------------------+| Category|Count| Description|+----------+-----+------------------+|Category A| 100|This is category A||Category B| 120|This is category B||Category C| 150|This is category C|+----------+-----+------------------+
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal
appName = "PySpark Example - Python Array/List to Spark Data Frame"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# List
data = [('Category A', Decimal(100), "This is category A"),
('Category B', Decimal(120), "This is category B"),
('Category C', Decimal(150), "This is category C")]
# Create a schema for the dataframe
schema = StructType([
StructField('Category', StringType(), True),
StructField('Count', DecimalType(), True),
StructField('Description', StringType(), True)
])
# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)
# Create data frame
df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()