Use Hadoop File System Task in SSIS to Write File into HDFS

Raymond Tang Raymond Tang 0 5744 2.14 index 2/25/2018

Context

SQL Server Integration Service (SSIS) has tasks to perform operations against Hadoop, for example:

  • Hadoop File System Task
  • Hadoop Hive Task
  • Hadoop Pig Task

In Data Flow Task, you can also use:

  • Hadoop HDFS Source
  • Hadoop HDFS Destination

In this page, I’m going to demonstrate how to write file into HDFS through SSIS Hadoop File System Task.

References

https://docs.microsoft.com/en-us/sql/integration-services/control-flow/hadoop-file-system-task

Prerequisites

Hadoop

Refer to the following page to install Hadoop if you don’t have one instance to play with.

Install Hadoop 3.0.0 in Windows (Single Node)

SSIS

SSIS can be installed via SQL Server Data Tools (SSDT). In this example, I am using 15.1.

Create Hadoop connection manager

In your SSIS package, create a Hadoop Connection Manager:

https://api.kontext.tech/resource/83355751-b4ad-5ebd-b1fb-0f15f830dbbf

In WebHDFS tab of the editor, specify the following details:

https://api.kontext.tech/resource/5f796ff0-df69-50f6-95b8-8993280ca455

Click Test Connection button to ensure you can connect and then click OK:

https://api.kontext.tech/resource/0e8e0859-29a3-5ee0-8770-f9dcbb498f77

Create a file connection manager

Create a local CSV file

Create a local CSV file named F:\DataAnalytics\Sales.csv with the following content:

Month,Amount 1/01/2017,30022 1/02/2017,12334 1/03/2017,33455 1/04/2017,50000 1/05/2017,33333 1/06/2017,11344 1/07/2017,12344 1/08/2017,24556 1/09/2017,46667

Create a file connection manager

Create a file connection manager Sales.csv which points to the file created above.

https://api.kontext.tech/resource/36ff826b-47d0-5b09-97aa-4e1c904bca9a

Create Hadoop File System Task

Use the two connection managers created above to create a Hadoop File System Task:

https://api.kontext.tech/resource/40260127-3d99-5698-8f4e-a92ccdeec201

In the above settings, it uploads Sales.csv into /Sales.csv in HDFS.

Run the package

Run the package or execute the task to make sure it is completed successfully:

https://api.kontext.tech/resource/62df3c88-696c-573c-a1b2-1c8cb19b312c

Verify the result via HDFS CLI

Use the following command to verify whether the file is uploaded successfully:

hdfs dfs -ls \

https://api.kontext.tech/resource/6f52ad59-3bbe-517f-9f34-f0f67a1aa6ba

You can also print out the content via the following command:

hdfs dfs -cat /Sales.csv

https://api.kontext.tech/resource/552cd0a1-08a9-5eea-97c4-3d9bd441f84c

Verify the result through Name Node web UI

https://api.kontext.tech/resource/dc95c309-0e8a-5384-b900-a86a32a83d87

https://api.kontext.tech/resource/1a12d2dd-9856-50e9-a865-1552a03ef7d0

WebHDFS REST API reference

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

Summary

It is very easy to upload files into HDFS through SSIS. You can also upload the whole directory into HDFS through this task if you change the file connection manager to pointing to a folder.

If you have any questions, please let me know.

hadoop hdfs ssis

Join the Discussion

View or add your thoughts below

Comments