Scala: Change Data Frame Column Names in Spark

Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Scala.

infoThis is the Scala version of article: Change DataFrame Column Names in PySpark

Construct a dataframe

The following code snippet creates a DataFrame from an array of Scala list. Spark SQL types are used to create the schema and then SparkSession.createDataFrame function is used to convert the array of list to a Spark DataFrame object.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val data = 
Array(List("Category A", 100, "This is category A"),
List("Category B", 120, "This is category B"),
List("Category C", 150, "This is category C"))

// Create a schema for the dataframe
val schema =
  StructType(
    StructField("Category", StringType, true) ::
    StructField("Count", IntegerType, true) ::
    StructField("Description", StringType, true) :: Nil)

// Convert list to List of Row
val rows = data.map(t=>Row(t(0),t(1),t(2))).toList

// Create RDD
val rdd = spark.sparkContext.parallelize(rows)

// Create data frame
val df = spark.createDataFrame(rdd,schema)
print(df.schema)
df.show()

The data frame looks like the following:

+----------+-----+------------------+
|  Category|Count|       Description|
+----------+-----+------------------+
|Category A|  100|This is category A|
|Category B|  120|This is category B|
|Category C|  150|This is category C|
+----------+-----+------------------+

Print out column names

DataFrame.columns can be used to print out column list of the data frame:

print(df.columns.toList)

Output:

List(Category, Count, Description)

Rename one column

We can use withColumnRenamedfunction to change column names.

val df2 = df.withColumnRenamed("Category", "category_new")
df2.show()

Output:

scala> df2.show()
+------------+-----+------------------+
|category_new|Count|       Description|
+------------+-----+------------------+
|  Category A|  100|This is category A|
|  Category B|  120|This is category B|
|  Category C|  150|This is category C|
+------------+-----+------------------+

Column Categoryis renamed to category_new.

Rename all columns

Function toDFcan be used to rename all column names. The following code snippet converts all column names to lower case and then append '_new' to each column name.

# Rename columns
val new_column_names=df.columns.map(c=>c.toLowerCase() + "_new")
val df3 = df.toDF(new_column_names:_*)
df3.show()

Output:

scala> df3.show()
+------------+---------+------------------+
|category_new|count_new|   description_new|
+------------+---------+------------------+
|  Category A|      100|This is category A|
|  Category B|      120|This is category B|
|  Category C|      150|This is category C|
+------------+---------+------------------+

You can use similar approach to remove spaces or special characters from column names.

infoIn Scala, _* is used to unpack a list or array. For this example, the parameter is String*.

Use Spark SQL

Of course, you can also use Spark SQL to rename columns like the following code snippet shows:

df.createOrReplaceTempView("df")
spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show()

The above code snippet first register the dataframe as a temp view. And then Spark SQL is used to change column names.

Output:

scala> spark.sql("select Category as category_new, Count as count_new, Description as description_new from df").show()
+------------+---------+------------------+
|category_new|count_new|   description_new|
+------------+---------+------------------+
|  Category A|      100|This is category A|
|  Category B|      120|This is category B|
|  Category C|      150|This is category C|
+------------+---------+------------------+

Run Spark code

You can easily run Spark code on your Windows or UNIX-alike (Linux, MacOS) systems. Follow these articles to setup your Spark environment if you don't have one yet: