Tuesday, June 22, 2010

This is the fourth post about Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the third part of the seminar which discussed SSIS.

  • Log everything that happens inside the package. Use the SSIS log providers and the event handlers. The effect on performance is not significant.
  • Analyize the log data. You can analize total execution time, SSAS partition processing time and much more.
  • Log detailed error information about rejected rows. You can even log the actual error row.
  • Monitor also the execution of the ETL jobs.
  • Data Profiling is a new feature in SSIS 2008. It's used for analysis of the data's quality in the DWH. Go and learn.
  • Configure everything you can in the package configuration, such as: tasks, containers, variables, connection managers and data flow components. You can use XML config file, config from parent package, registry configuration or even configuration from a table.
  • The Cache Transformation saves data in the cache. It's very useful in lookups.
  • Data types: Make them as "small" as possible (better performance). Be aware of precision issues when using money, float and decimal types.
  • Don't sort in SSIS unless it's absolutely necessary.
  • Sometimes using T-SQL will be faster then running it in SSIS.

Some of the tips are also relevant in Informatica. For example, I always let the DB make the sort whenever possible. Sometimes it's also works with joins. Running packages or procedures can be much more faster than running it in the ETL, but if you look at it from maintenance point of view you'll do it in the ETL. It depends on much factors, such as the time you can have to run the ETL, the knowledge in your team and more.

Tuesday, June 22, 2010 10:40:51 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, November 09, 2009

In both SSIS and Informatica, the union transformation doesn't delete duplicate rows. In SSIS it's clear because the transformation is called "Union all", which reminds us the union all statement in SQL which doesn't remove duplicate rows. The confusing part is in Informatica, where the transformation is called Union although it doesn't remove duplicates. One of my teammates got confused this week so I thought it's worth blogging about.

One more tip about ETLs with union transformations: When you'll investigate, maintain or fix ETL it will be much help if you'll know where each row came from. That's why I recommend adding a column named "src" or something like that and store there the source of the row before the union happened.

Monday, November 09, 2009 11:41:31 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, October 27, 2009

Lucky for me, I have the opportunity to work with more than one ETL tool in my daily work. In the far past, I worked much with DTS and PL/SQL packages (which is an ETL, but not exactly a tool). Nowadays I work both with SSIS and Informatica so I can compare them in several aspects. I don't think that one is better than the other. It's just that they have different approaches to the ETL mechanism. Anyway, I always expect that the data transformations will behave the same in every tool and I found one transformation that behaves differently between the tools - the Router.

The Informatica's Router, if you're not familiar with, receives a data row and routes it a specific data flow. Its parallel transformation in SSIS is the Conditional Split. As I said, I always thought that they behave the same. In both transformations you provide boolean expressions that determine where the row will be routed. The difference is that in SSIS, "each input row can be sent to only one output, that being the output for the first condition that evaluates to true" (from SSIS's documentation). In programmer's words, it behaves like switch-case statement with break in each condition. In Informatica, "If a row meets more than one group filter condition, the Integration Service passes this row multiple times" (from Informatica's documentation). Back to the programmer's language, it's like switch-case statement with no break inside the conditions.

If you know other transformation that behaves different between ETL tools, I'll be happy to know.

Tuesday, October 27, 2009 11:10:34 PM (Jerusalem Standard Time, UTC+02:00)
 Monday, September 22, 2008
My friends were stuck with a totally weird bug this week. After a day of frustration they called me for the rescue. It took me some time to figure it out and I think that every SSIS developer (and maybe every developer) can learn a thing or two from others' mistakes.

The mission: The data flow takes one table with duplicate rows and copies it to another table and makes sure that every row will appear only once. In the way, the data flow also adds some irrelevant fields. Among them is the Create_User and Create_Date fields which tells by who and when the package last ran.
How my friends did it: Again, it's a very simple flow. They only added Derived Column transformation to add the new fields and then they added an Aggregate transformation to make every row appear only once.

Note that this is not the real package. It's a sample I did on my machine to show it here.

The Bug: When I first seen this it seemed to me very simple flow and I asked myself how can it be that this is happening:

As you can see, it seems that the Aggregate transformation is not deterministic. Sometimes it outputs 99 rows, sometimes 198 and in some other times I get other results as well.
Investigating: I wanted to see what's the difference between the table that I got in the first time (99 rows) and the table I got in the second time (198 rows) so I changed the destination table and compared the two tables. I ran "select * from A where Column1+Column2+... not in (select Column1+Column2+... from B)"-style query but it was no use - it showed me that there were no rows that appeared only in one of the tables. In this step I really started to think (as my friends did) that maybe the Aggregate transformation has something wrong inside... Instead of blaming Microsoft, I decided to think. I needed to see what can make the flow non-deterministic. Then, it hit me.


The only non-deterministic component in the flow is the Derived Column because it has the getdate() function (it may be simple to see here, but in the original package the derived column transformation had many fields). The results of this function may differ in the milliseconds, especially for large tables. Then I looked in the Aggregate transformation and seen that the Create_Date column also was in the Group by operation, meaning that if two rows has different millisecond they will be placed twice in the destination table, although they are the same in every column. That's it, the bug was found. But still, one question remained: Why the query did not show me this? The answer is also simple but tricky to find: In the comparison query I concatenated all the columns in the tables in order to compare the results. When I did this, I casted the Create_Date to nvarchar which truncated the milliseconds.

Conclusions:
  • Pay attention to non-deterministic elements in what you do, whether it's code or ETL process.
  • When you do dummy stuff like checking all the checkboxes in a list - think what are the outcomes.
  • Call Miky when you're desperate.
Monday, September 22, 2008 8:10:48 AM (Jerusalem Daylight Time, UTC+03:00)