Create ETL Project for Collecting Sales Data using SSIS

Raymond Tang Raymond Tang 0 8313 2.06 index 6/8/2014

Serial: An Introduction to SQL Server Features

Case Scenario

For this ETL project, the requirements are listed below:

  • Sales data will be pushed to specified shared folder regularly.
  • Data is stored in CSV files with columns: Sale Number, Product Name, Product Color, Sale Amount, Sale Area and date.
  • Minions of sales records are stored in each file.
  • The flat files need to be archived after processing.
  • The process can be scheduled to run automatically and periodically.

Preparation

In order to create dummy sales files, we can use C# to create one console application to generate them randomly.

    class Program
    {
        static int SaleNo = 10000000;
        static void Main(string[] args)
        {
            var folder = @"E:\Temp\Archive";

            var fileCount = 10;
            var saleCountInEachFile = 100000;

            for (int i = 0; i < fileCount; i++)
            {
                var fileName = Path.Combine(folder, string.Format("Sales_{0}.csv", i));
                //if (!File.Exists(fileName))
                //{
                //    File.CreateText(fileName);
                //}
                using (var writer = new StreamWriter(fileName, false, Encoding.UTF8))
                {
                    writer.WriteLine(string.Format("{0},{1},{2},{3},{4},{5}", "SaleNo", "ProductName", "ProductColor", "Amount", "AreaName", "SaleDate"));
                    for (int j = 0; j < saleCountInEachFile; j++)
                    {
                        writer.WriteLine(string.Format("{0},{1},{2},{3},{4},{5}", GetSaleNo(), GetRandomProduct(), GetRandomColor(), GetRandomPrice(), GetRandomArea(), GetRandomDay()));
                    }
                }
                Console.WriteLine("{0} is generated.", fileName);
            }
            Console.ReadKey();
        }

        static int GetSaleNo()
        {
            return SaleNo++;
        }

        static string GetRandomColor()
        {
            var colors = new string[] { "R", "Y" };
            Random gen = new Random();
            return colors[gen.Next(1000) % colors.Length];
        }

        static string GetRandomArea()
        {
            var areas = new string[] { "Earth", "Mars", "Saturn" };
            Random gen = new Random();
            return areas[gen.Next(1000) % areas.Length];
        }

        static int GetRandomPrice()
        {
            var minPrice = 888;
            var maxPrice = 5999;

            return new Random().Next(minPrice, maxPrice);
        }

        static string GetRandomProduct()
        {
            var products = new string[] { "myPhone", "yourPhone", "ourPhone" };
            Random gen = new Random();
            return products[gen.Next(1000) % products.Length];
        }

        static DateTime GetRandomDay()
        {
            DateTime start = new DateTime(2011, 1, 1);
            Random gen = new Random();
            int range = (DateTime.Today - start).Days;
            return start.AddDays(gen.Next(range));
        }
    }

Run the program and 10 CSV files will be created.

Now move these files into parent folder. In next step, we are going to create one SSIS package to load these files into table dbo.Sales and move them to Archive folder.

Create SSIS Package to Load Files

SSIS (SQL Server Integration Service) provides components/tasks to implement complex ETL projects. The following steps will illustrate how to use it build package rapidly.

  1. Create SSIS Project ‘ETL-Sample’

  2. Add package ‘Package-HiSqlServer-LoadSales.dtsx’

  3. Add two variables

https://api.kontext.tech/resource/e50ef09e-8f63-5362-9b66-5345480376f7

  1. Create one connection manager ‘RAYMOND-PC\MSSQL2012.HiSqlServer’ connecting to the database using SQL account.

https://api.kontext.tech/resource/371fde3a-ed14-55ea-ad00-90c5c55ea982

  1. Create another flat file connection manager ‘Flat File Connection Manager - Sales File’ connecting to one of the sales file we created. Use expression to bind the connection to variable FileName. Settings are as following:

https://api.kontext.tech/resource/e634eab8-ce52-5ea0-aae9-89deddf916a0

https://api.kontext.tech/resource/d6623d3d-1a46-541c-ac49-e8811b6187ea

  1. Create Flat File Connection Manager ‘Sales_0.csv’.

https://api.kontext.tech/resource/6e33394c-71ce-5c98-978c-eab53783af02

Set the file path using expression:

https://api.kontext.tech/resource/b0089d77-91f3-512d-a835-798be8242bea

  1. Create Foreach Loop Container ‘Foreach Loop Container - Sales File’ to process each file one bye one. The settings are listed below:

https://api.kontext.tech/resource/1c305d65-7372-57d8-8bc2-5f43c192fa8e

https://api.kontext.tech/resource/aeab99d3-803b-5b84-a1fa-5814859bfeed

(* Set the variable FileName with value from the file path of the current loop)

  1. Create tasks in the container to import data, set variable values and move file to archive folder once processed.

https://api.kontext.tech/resource/4d132f96-cfbe-5d27-9b33-2deea4f01604

The data flow task is going to load data from the file source (Foreach Loop Container - Sales File) and then data is typed and transformed to join dbo.Areas and dbo.Products in database ‘HiSqlServer-Sample’ through connection ‘RAYMOND-PC\MSSQL2012.HiSqlServer’. Finally, data is loaded into the OLE DB destination (table dbo.Sales).

https://api.kontext.tech/resource/d5f29228-4009-56ee-ba63-c14d59112b6b

The script task changes the variable value for the following step.

https://api.kontext.tech/resource/e069617f-4098-5376-a686-33e5dbb9e4f9

The scripts are:

public void Main()
        {
            // TODO: Add your code here
            var filepath = Dts.Variables["User::FileName"].Value.ToString();
            var dir = Path.GetDirectoryName(filepath);
            var newPath = Path.Combine(dir, "Archive");
            Dts.Variables["User::ArchiveFileName"].Value = newPath;

            Dts.TaskResult = (int)ScriptResults.Success;
        }

The final task is a File System Task, which moves the processed file in the loop to the archive folder. Use ‘Sales_0.csv’ as destination connection. The file path of this connection is already changed by the previous script task.

https://api.kontext.tech/resource/c07b0a2c-9d22-5cfd-a2ad-bae377e49061

  1. Execute the package and all the data will be loaded into the database.

https://api.kontext.tech/resource/f9a81431-4ef8-52ad-84ff-7517205c771d

(* Files are being loaded one by one.)

Using the following query, you can find 1,000,000 records are loaded into the database.

SELECT COUNT(*) FROM dbo.Sales

Schedule

Use SQL Server Agent Job, you can set up schedules for running the above package. Please reference <http://technet.microsoft.com/en-us/library/ms139805(v=SQL.90).aspx>.


Join the Discussion

View or add your thoughts below

Comments