Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Scala.
infoThis is the Scala version of article: Change DataFrame Column Names in PySpark
Construct a dataframe
The following code snippet creates a DataFrame from an array of Scala list. Spark SQL types are used to create the schema and then SparkSession.createDataFrame function is used to convert the array of list to a Spark DataFrame object.
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val data =
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", 150, "This is category C"))
// Create a schema for the dataframe
val schema =
StructType(
StructField("Category", StringType, true) ::
StructField("Count", IntegerType, true) ::
StructField("Description", StringType, true) :: Nil)
// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList
// Create RDD
val rdd = spark.sparkContext.parallelize(rows)
// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()
The data frame looks like the following:
+----------+-----+------------------+
| Category|Count| Description|
+----------+-----+------------------+
|Category A| 100|This is category A|
|Category B| 120|This is category B|
|Category C| 150|This is category C|
+----------+-----+------------------+
Print out column names
DataFrame.columns can be used to print out column list of the data frame:
print(df.columns.toList)
Output:
List(Category, Count, Description)
Rename one column
We can use withColumnRenamedfunction to change column names.
val df2 = df.withColumnRenamed("Category", "category_new")
df2.show()
Output:
scala> df2.show()
+------------+-----+------------------+
|category_new|Count| Description|
+------------+-----+------------------+
| Category A| 100|This is category A|
| Category B| 120|This is category B|
| Category C| 150|This is category C|
+------------+-----+------------------+
Column Categoryis renamed to category_new.
Rename all columns
Function toDFcan be used to rename all column names. The following code snippet converts all column names to lower case and then append '_new' to each column name.
# Rename columns
val new_column_names=df.columns.map(c=>c.toLowerCase() + "_new")
val df3 = df.toDF(new_column_names:_*)
df3.show()
Output:
scala> df3.show()
+------------+---------+------------------+
|category_new|count_new| description_new|
+------------+---------+------------------+
| Category A| 100|This is category A|
| Category B| 120|This is category B|
| Category C| 150|This is category C|
+------------+---------+------------------+
You can use similar approach to remove spaces or special characters from column names.
infoIn Scala, _* is used to unpack a list or array. For this example, the parameter is String*.
Use Spark SQL
Of course, you can also use Spark SQL to rename columns like the following code snippet shows:
df.createOrReplaceTempView("df")
spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show()
The above code snippet first register the dataframe as a temp view. And then Spark SQL is used to change column names.
Output:
scala> spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show()
+------------+---------+------------------+
|category_new|count_new| description_new|
+------------+---------+------------------+
| Category A| 100|This is category A|
| Category B| 120|This is category B|
| Category C| 150|This is category C|
+------------+---------+------------------+
Run Spark code
You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: