Scala: Parse JSON String as Spark DataFrame

Raymond Tang Raymond Tang 0 12621 7.63 index 12/17/2020

This article shows how to convert a JSON string to a Spark DataFrame using Scala. It can be used for processing small in memory JSON string.

Sample JSON string

The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named IDand ATTR1 with data type as integer and string respectively.

[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}
]

Read JSON string

In Spark, DataFrameReader object can be used to read JSON.

def json(jsonDataset: Dataset[String]): DataFrame

Refer to the following official documentation for more details about this function.

Spark 3.0.1 ScalaDoc - org.apache.spark.sql.DataFrameReader

*Note - this function is available from Spark 2.0 only.

To create DataFrame object, we need to convert JSON string to Dataset[String] first.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val json = """[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}]"""

val jsonDataset = Seq(json).toDS()

The output of jsonDatasetis like the following:

jsonDataset: org.apache.spark.sql.Dataset[String] = [value: string]

Now, we can use readmethod of SparkSessionobject to directly read from the above dataset:

val df = spark.read.json(jsonDataset)
df: org.apache.spark.sql.DataFrame = [ATTR1: string, ID: bigint]

Spark automatically detected the schema of the JSON and converted it accordingly to Spark data types.

The content of the data frame looks like the following:

scala> df.show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
+-----+---+

Read from multiple JSON string variables

In the above example, we only read data from one JSON string object. We can use Seqto construct multiple ones.

The following code snippet shows how to do that:

val json1 = """[
{"ID":4,"ATTR1":"123"},
{"ID":5,"ATTR1":"456"},
{"ID":6,"ATTR1":"789"}]"""

spark.read.json(Seq(json,json1).toDS()).show()

Output:

scala> spark.read.json(Seq(json,json1).toDS()).show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
|  123|  4|
|  456|  5|
|  789|  6|
+-----+---+

The schema of the DataFramecontains two fields with data type as StringTypeand LongTyperespectively:

scala> spark.read.json(Seq(json,json1).toDS()).schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(ATTR1,StringType,true), StructField(ID,LongType,true))

Summary

When reading data directly from database or structured files, similar readfunctions can be used to easily convert the input dataset to a Spark DataFrame.

how-to scala spark

Join the Discussion

View or add your thoughts below

Comments