Write and Read Parquet Files in HDFS through Spark/Scala

Raymond Tang Raymond Tang 0 26562 9.98 index 3/17/2018

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder.

Write and Read Parquet Files in Spark/Scala

In this page, I am going to demonstrate how to write and read parquet files in HDFS.

Sample code

import org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.sql.{DataFrame, SQLContext}object ParquetTest {  def main(args: Array[String]) = {    // Two threads local[2]    val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("ParquetTest")    val sc: SparkContext = new SparkContext(conf)    val sqlContext: SQLContext = new SQLContext(sc)    writeParquet(sc, sqlContext)    readParquet(sqlContext)  }  def writeParquet(sc: SparkContext, sqlContext: SQLContext) = {    // Read file as RDD    val rdd = sqlContext.read.format("csv").option("header", "true").load("hdfs://0.0.0.0:19000/Sales.csv")    // Convert rdd to data frame using toDF; the following import is required to use toDF function.    val df: DataFrame = rdd.toDF()    // Write file to parquet    df.write.parquet("hdfs://0.0.0.0:19000/Sales.parquet");  }  def readParquet(sqlContext: SQLContext) = {    // read back parquet to DF    val newDataDF = sqlContext.read.parquet("hdfs://0.0.0.0:19000/Sales.parquet")    // show contents    newDataDF.show()  }}

The output should be similar to the previous example.

View the parquet files in HDFS

The following command can be used to list the parquet files:

F:\DataAnalytics\hadoop-3.0.0\sbin>hdfs dfs -ls / Found 4 items -rw-r--r--   1 fahao supergroup        167 2018-02-26 14:42 /Sales.csv drwxr-xr-x   - fahao supergroup          0 2018-03-17 15:44 /Sales.parquet -rw-r--r--   1 fahao supergroup        167 2018-02-26 14:11 /Sales2.csv -rw-r--r--   1 fahao supergroup          9 2018-02-19 22:18 /test.txt

You can also use the HDFS website portal to view it:

https://api.kontext.tech/resource/dac462d6-78ba-50b9-95f4-868376690aad

Navigate into the parquet folder:

https://api.kontext.tech/resource/7af8d53e-4424-5039-8aa6-3e7940228bd7

hdfs lite-log parquet scala spark spark-file-operations

Join the Discussion

View or add your thoughts below

Comments