Tuesday, April 22, 2008

One more thing about getting a file from the web/SharePoint and using it as a source in SSIS: If you need to authenticate just change the xml.open command to:

xml.open "GET", URL, false, "user", "password"

where user and password are the user & password that has permissions to the desired file. Note that it is VERY recommended to have an application user, so the password won't be changed in the future. If you don't have such user and you must change your password in the future, do not forget to change it in the script. My tip: add a reminder in your calendar to change the password in the script.

In this point I don't know if you can authenticate using SSL or stronger protocols using VB script.

Tuesday, April 22, 2008 10:20:05 PM (Jerusalem Daylight Time, UTC+03:00)
 Saturday, March 15, 2008
It figures that using simple excel file as a source in SSIS is not so trivial, especially if your source is in the web or in your SharePoint portal. At first you'll think it's easy - just declare the excel source as a url (the url of the excel file, for example) and it will succeed. The problem is that Microsoft let you think it's working. Click on the excel source and you'll see in its properties that the source path is the local temporary internet files, meaning that the source is a local copy which is not up-to-date, so it's worth nothing.
Here's what I tried to do and the final (and successful solution):

1. Use the File System task. It won't work because you can't declare an URI there.
2. In the MSDN forum (I can't find the link right now) they say to write a script, so I also tried this. Using the Script task, I written a code in VB.NET which using the System.IO libary of the .NET framework and copies the excel file (using its URI) to the desired location in the local computer. Running it, I got an error saying that the script can't use URIs...
After trying this I understood that every code or action running in the SSIS context won't work with URIs. I'm not sure I know why Microsoft developers built it that way (or maybe it's just another bug). Anyway, the next step is the solution.
3. Build an executable file that performs the desired copy task. You can't use regular batch (.bat) file because DOS/CmdExec does not know how to work with URIs. So, there are two ways to perform this:
a. Download this and use it as a copier from the web.
b. Use the following code and save it as a Visual Basic Script file (*.vbs):

'GetRemoteBinaryFile.vbs
TheFile = "myExcelFile.xls"
DestFolder = "C:\SSIS_Sources"
URL = "http://mySite/myFolder/myExcelFile.xls"
Set xml = CreateObject("Microsoft.XMLHTTP")
xml.Open "GET", URL, False
xml.Send
set oStream = CreateObject("Adodb.Stream")
Const adTypeBinary = 1
Const adSaveCreateOverWrite = 2
Const adSaveCreateNotExist = 1
oStream.type = adTypeBinary
oStream.open
oStream.write xml.responseBody
' Overwrite an existing file
oStream.savetofile DestFolder & TheFile, adSaveCreateOverWrite
oStream.close
set oStream = nothing
Set xml = Nothing

After you have your file (vbs or exe) you can use the Execute Process Task in order to make the copy. In the task, declare that you want it to run your exe or vbs.  After that, just use a normal Data Flow Task, where the source excel file is in the local computer (the file that was copied in the previous task) and the destination is your desired DB.

Note that:
1. Before executing you must have the excel file already placed in your local computer, meaning that you must make the first copy before the first time you run the package. This is because SSIS performs integrity check before running the package and it checks that the file exists.
2. Even if the copy process is long (because it's coming from the web), don't worry. SSIS work synchroniously, meaning that the Data Flow task will not start until the Execute Process task which copies the file will end.

Enjoy.

Update: I added a post about authentication.

Sunday, March 16, 2008 6:47:26 AM (Jerusalem Standard Time, UTC+02:00)
 Friday, February 01, 2008
This is a good one: When you build a flat file connection to a csv file, you can preview the data. There, there's an option to skip some rows (Data rows to skip). If you'll leave it with a number greater than zero - the process itself will skip these rows!! I still wonder if this bug is By Design or not. If you wish, you can track this bug in Microsoft Connect.

Friday, February 01, 2008 2:00:16 PM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 15, 2007
I really think that the time dimension is the most complex dimension in 90% of the DWHs. The complexity is in two places: In the DWH design and also in Analysis Services (or any other BI tool).
First of all - why we didn't take the already-made Server Time Dimension which exists in SSAS 2005? For two reasons: The first is that the Project Real guys do not recommend using it (you can find their SSAS article here). The second is that we wanted to have some features that are not available in the server time dimension, such as Hebrew date. In a matter of fact, even if we didn't have such feature we still would build the time dimension ourselves because it's giving you much more control over the dimension. For example, you can always add some new attributes which Microsoft developers didn't think about.
I started myself to build the time dimension in excel. I figured out that this mission is little more complex that I thought it would. Most of the functions I wrote were simple, but there were some complicated ones. So here are some tips for you if you want to build your time dimension using Excel:
  • If you want to week number for every date, do not write the function yourself... Excel has function called weeknum. If you don't have it just add the function toolbox which has it (I can't recall its name right now. check in excel help).
  • If you want to have records for every level in your hierarchy (not only for days), put every level in its own excel file (not excel tab). It will help you later when you will transfer it to your DB.
  • Check yourself. Pick randomly some dates and check that all of its record has correct data.
After building the excel files I needed to transfer it to my Oracle server. I used SSIS because I didn't want to wait for my DBA to copy these files into the Informatica server (it can't use my the local files, it has to be in its server. SSIS can use local files). This also was a little tricky. First of all, close excel when running the SSIS packages, otherwise it will fail. Second, when moving the non-leaf levels, go into the columns section in the destination box and erase the irrelevant columns. It will reduce the chance for errors. Finally, click on the source box and click on "Show advanced editor". Enter the source's output columns options and define properly the columns' data types. This also will reduce the chance for errors.

I had a little bit of an argument with my DBA about how should the time dimension be. I think that the time dimension does not have to be processes at all. My time dimension is from 1960 until 2020, so no daily ETL is required. She says that all the logic has to be in Informatica so I need to develop a mapping for this dimension. I think that we both are right and that's because that in ideal world she is right. In every developers team, all the BL has to be in one place. But we don't have much time (the deadline is very close) so I won't spend the time building more mapping in Informatica when I have the time dimension already made in excel.

Maybe someday I will have the time to do this. Maybe not.

Monday, October 15, 2007 8:18:38 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, October 14, 2007
I guess that this will not be my last post on this subject, but I want to start sharing some thoughts and tips from my experience when designing and building DWH. In this post I will focus on the fact & dimensions tables relationship in terms of data completeness (if you wonder what it is, read on).

Before you start to design the DWH, sit and talk with the people who built the systems which you take your data from, including the DBA. For every table, ask them what is the primary key (it's NOT always defined properly in the DB), then ask them again and then ask them if they are sure. It happened to me that I discovered that the systems guys were wrong about their DB's primary keys.
The same thing is about Foreign keys and here you should be even more careful. Even if they claim so, check yourself that every foreign key in the fact table is placed correctly in the dimension table, especially when the fact table has far history records. Sometimes system developers or even worse - system DBAs delete records from the dimension tables that are not relevant. This will cause that these keys will still be in the fact's history records but will not be found in the dimension table, causing uncomplete relationship between the fact and the dimension table.

So far is about the part when you talk and "investigate" the system developers (the DWH design). What to do when you actually developing the DWH? First, develop the tables of your dimensions tables. Do not forget to add the primary keys in the dimensions tables and the primary and foreign keys in the fact table. Then develop the ETL processes and go for the dimensions first.  If you know that the dimension has completeness problems with the fact table that you will develop later (you talked with the system developers, remember?), add UNDEFINED (UD key) record for the dimension table. Later, when developing the fact table's ETL process, make Join with the dimension table and check that the records' foreign key exists there. If not - change the key to UD. In SSIS and Informatica (and I guess that also in other products I don't know, such as DataStage) you can use Lookup instead of Joiner if the dimension table is less that 1G records. That will optimize the ETL process. After you developed all your ETLs, run the dimension processes. After they finish (assuming everything went OK) run the fact table's ETL process. If it succeeded you can go and have a drink. If not - check what went wrong. If you want to know which keys didn't showed up in the dimension table and causes the incompleteness problem, you can disable (not delete) the foreign key from the fact table and run the process again. Then, with a simple SQL query, check which foreign keys don't exist in the dimension table. Go back to your ETL design and check what you did wrong. As I pointed before, in this step you might be very angry at the system developers...

That is all for now. As I said, I assume that more ideas will come on in the future.

Sunday, October 14, 2007 7:11:49 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, September 24, 2007
My friend, Ilya, had a problem in SSIS. He had a .csv file with too many commas. The meaning is that strings that started and ended with inverted commas (") and had commas inside it were recognized by SSIS as new column. For example, the row:
"My name, is Miky", 200, 10 was recognised by SSIS as four columns instead of three. Ilya wrote down a code for SSIS (in VB) that run before the package begin its work. Here it is, hope it will help who ever seen this.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO
Imports System.Text
Imports Microsoft.VisualBasic.FileIO
Public Class ScriptMain
Public Sub Main()
  Dim csvFileFullPath As String
  Dim tabFileFullPath As String
  csvFileFullPath = Dts.Connections("Your CSV Connnection").ConnectionString
  tabFileFullPath = Dts.Connections("Your Table Connection").ConnectionString
  Using tabStreamWriter As New StreamWriter(tabFileFullPath, False, System.Text.Encoding.GetEncoding(1255))
  Using csvFileReader As New StreamReader(csvFileFullPath, System.Text.Encoding.GetEncoding(1255),True)
  Dim currentRow As String

  currentRow = csvFileReader.ReadLine()
  tabStreamWriter.WriteLine(currentRow)
  While Not csvFileReader.EndOfStream
    Dim outputRow As New Text.StringBuilder()
    Dim tmp, tmp1 as String
    Dim offset as Int32 = 1
    Dim beginS, endS As Int32

    beginS = 1
    currentRow = csvFileReader.ReadLine()
    beginS = InStr(offset, currentRow, """")
    While Not beginS = 0 Or offset > Len(currentRow)
      endS = InStr(beginS+1, currentRow, """")
      tmp = Mid(currentRow, beginS, endS - beginS)
      tmp1 = Replace(tmp, ",", " ")
      currentRow = Replace(currentRow, tmp, tmp1)
      offset = endS + 1
      beginS = InStr(offset, currentRow, """"")
    End While
    outputRow.Append(currentRow)
    tabStreamWriter.WriteLine(outputRow.ToString())
  End While
End Using
End Using
Dts.TaskResult = Dts.Result.Success
End Sub
End Class

The solution here is to search for any comma (,) that is between two inverted commas (") and replace it by space.
Although it is a good solution, I would take another solution: Replace any comma by special string, such as &Miky&, convert the csv file into table, and after that go over that column(s) and replace any &Miky& by comma.

Monday, September 24, 2007 7:31:56 PM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, September 09, 2007
I was asked how to get to SSIS log to see how much time took for the package to run.
Well, that depends.
On Development:
When developing new package, after running the process (click on the green arrow or press F5) there's a new tab called Progress. Clicking it will show you everything about the package's execution, including the time it started and the time it finished.

On Production:
When developing the package, open the SSIS menu (Yes, there is a menu called as the product's name. Microsoft...) and click on Logging... There, you can define logs for your package. You can log in many ways: Writing to SQL sever, output file, XML file and more. I recommend logging into SQL server and logging only the big and "hard" parts in your data flow. In the Details tab, pick up only the exceptional events, such as onError, onTaskFailed and onWarning. If you wish to know how much time took for you package to run, also pick up onProgress.

Follow this link to read about every event in SSIS.

Monday, September 10, 2007 6:09:23 AM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, July 15, 2007
I never thought that I'll do a commercial to Microsoft, but Project Real is a great thing that they did and they should get the credit for it. This project is a full end-to-end BI solution, including ETLs (using SSIS), Analysis Services cubes and mining modules, Reporting Services reports, end-user Panorama views and more.

We work with Panorama as our main GUI tool to show our users the cube's data as tables, charts, dashboards, etc, so this project is really helping us to learn how to implement our project from the first ETL step all the way to the last Panorama step.

Recommended.

Sunday, July 15, 2007 7:05:54 AM (Jerusalem Daylight Time, UTC+03:00)
 Thursday, May 31, 2007

Few months ago we were given an assignment to copy/move all our DTSs that were running with SQL Server 2000 to the new SQL Server 2005 Integration Services (SSIS). My friend Michael did it and wrote some important notes that he discovered when building ETLs with Integration Services. I decided to list them here because they are important and useful, especially for those who haven't got the time to develop with SSIS so far.

  • One of the greatest improvements in SSIS is that between the source and the destination of the ETL process you can do many things, such as making new fields, sorting, converting data types, union all between different sources, implement your logic on a field, and much more. This is much easier than ever because all you need to do is to add a block to the data flow task and define it for your purposes.
  • SSIS ships with a tool for migrating SQL Server 2000 DTSs. Do not use this tool. Sometimes the result of the conversion is not good enough and in all cases you can't edit the new migrated data task.
  • When making a connection to a non-Microsoft DB, such as Oracle, use OLE DB client instead of the out-of-fashion ODBC.
  • When the destination field is shorter (string type) than the source, add a data conversion block and cut the string. Otherwise, there will be an annoying warning even if the truncation is wanted.
  • In many times (when working with non-Microsoft providers) the automatic recognition of the length & types of the source fields is not correct. Enter the source block and edit these properties by yourself.
  • When moving a Unicode field (data type DT_WSTR) to a non-Unicode destination field (DT_STR), a data conversion block is required.
  • SQL Server 2000 Stored Procedures will work in SSIS, but the Linked Servers definitions are problematic. Consider another options rather than using linked servers.
  • When the source/destination is a CSV file, use Flat File Connection. But if it is an Excel file (.xls), use a Microsoft Jet OLE DB connection and define the source as OLE DB Source (yes, it will work with Excel files).

Again, thanks for Michael for making and sharing these notes.

Friday, June 01, 2007 6:17:10 AM (Jerusalem Daylight Time, UTC+03:00)