Pandas DataFrame to Spark DataFrame
The following code snippet shows an example of converting Pandas DataFrame to Spark DataFrame:
import mysql.connector
import pandas as pd
from pyspark.sql import SparkSession
appName = "PySpark MySQL Example - via mysql.connector"
master = "local"
spark = SparkSession.builder.master(master).appName(appName).getOrCreate()
# Establish a connection
conn = mysql.connector.connect(user='hive', database='test_db',
password='hive',
host="localhost",
port=10101)
cursor = conn.cursor()
query = "SELECT id, value FROM test_table"
# Create a pandas dataframe
pdf = pd.read_sql(query, con=conn)
conn.close()
# Convert Pandas dataframe to spark DataFrame
df = spark.createDataFrame(pdf)
df.show()
In this code snippet, SparkSession.createDataFrame API is called to convert the Pandas DataFrame to Spark DataFrame. This function also has an optional parameter named schema which can be used to specify schema explicitly; Spark will infer the schema from Pandas schema if not specified.
Spark DaraFrame to Pandas DataFrame
The following code snippet convert a Spark DataFrame to a Pandas DataFrame:
pdf = df.toPandas()
Note: this action will cause all records in Spark DataFrame to be sent to driver application which may cause performance issues.
Performance improvement
To improve performance, Apache Arrow can be enabled in Spark for the conversions. Refer to this article for more details:
Improve PySpark Performance using Pandas UDF with Apache Arrow