Sunday, October 14, 2007
I guess that this will not be my last post on this subject, but I want to start sharing some thoughts and tips from my experience when designing and building DWH. In this post I will focus on the fact & dimensions tables relationship in terms of data completeness (if you wonder what it is, read on).

Before you start to design the DWH, sit and talk with the people who built the systems which you take your data from, including the DBA. For every table, ask them what is the primary key (it's NOT always defined properly in the DB), then ask them again and then ask them if they are sure. It happened to me that I discovered that the systems guys were wrong about their DB's primary keys.
The same thing is about Foreign keys and here you should be even more careful. Even if they claim so, check yourself that every foreign key in the fact table is placed correctly in the dimension table, especially when the fact table has far history records. Sometimes system developers or even worse - system DBAs delete records from the dimension tables that are not relevant. This will cause that these keys will still be in the fact's history records but will not be found in the dimension table, causing uncomplete relationship between the fact and the dimension table.

So far is about the part when you talk and "investigate" the system developers (the DWH design). What to do when you actually developing the DWH? First, develop the tables of your dimensions tables. Do not forget to add the primary keys in the dimensions tables and the primary and foreign keys in the fact table. Then develop the ETL processes and go for the dimensions first.  If you know that the dimension has completeness problems with the fact table that you will develop later (you talked with the system developers, remember?), add UNDEFINED (UD key) record for the dimension table. Later, when developing the fact table's ETL process, make Join with the dimension table and check that the records' foreign key exists there. If not - change the key to UD. In SSIS and Informatica (and I guess that also in other products I don't know, such as DataStage) you can use Lookup instead of Joiner if the dimension table is less that 1G records. That will optimize the ETL process. After you developed all your ETLs, run the dimension processes. After they finish (assuming everything went OK) run the fact table's ETL process. If it succeeded you can go and have a drink. If not - check what went wrong. If you want to know which keys didn't showed up in the dimension table and causes the incompleteness problem, you can disable (not delete) the foreign key from the fact table and run the process again. Then, with a simple SQL query, check which foreign keys don't exist in the dimension table. Go back to your ETL design and check what you did wrong. As I pointed before, in this step you might be very angry at the system developers...

That is all for now. As I said, I assume that more ideas will come on in the future.