Wednesday, June 15, 2011

For the google search: Informatica service start and stops

Yeah, that's exactly what happened to me. Everything looked just fine until the last step of the installation, where the installer pings the new integration service, but there is no answer.
Looking at the services manager, I see that the InformaticaService is stopped. When I restart it, it stops after 5-10 seconds. I thought it's because of the service's account settings, but it's not. After some time of search and re-attempts to make it work, this is what I can say if you're stuck as I was:

  • While you're in the installation wizard, on the node configuration step, press the "Advances Settings" button. There, you'll see the min. and max. port numbers. Go back to the node configuration and change the port number so it will be in that range (why it wasn't in the first place? ...)
  • The Windows PATH environment variable must contain the informatica server binary folder: <INFORMATICA_INSTALL_DIR>\Server\bin
  • If you're still stuck, look after the service.log file in the informatica directory. Look at the error and google it.
Hope I saved somebody a few hours.

Thursday, June 16, 2011 6:57:43 AM (Jerusalem Daylight Time, UTC+03:00)
 Monday, September 06, 2010
There are two ways to search for texts in mappings. The first one is to use the "Find in workspace" tool inside Informatica (CTRL+F). The problem with this tool is that it's looking for ports and transformation names. What can you do when you need to know in which transformations you used your unconnected lookup? For this sort of questions, you can use the other way: Export the mapping to XML file (Repository -> Export Objects) and search inside the file. There, you can search inside expressions, source qualifiers overrides, lookup filters and so on.
Monday, September 06, 2010 6:53:04 PM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, September 05, 2010

After a long break from blogging which caused by a lot of work in my organization, it's time to come back. I learned a lot in the last months and I have a lot to write. This time I'll write about working with static parameters in Informatica (version 8.6.0 HotFix 3).

Let's start from the beginning: Why should you work with parameters?
The answer (at least the main answer) is simple: It prevents hardcoding. For example, let's assume you write the following formula in your mapping: IIF(MyPort > 5, TRUE, FALSE). As we know, in the real world everything can change and now the formula needs to have 6 instead of 5. Changing this will make us open the mapping, edit it and worst of all - install it in the production environment, and as we know it can always make a lot of trouble. If the number 5 was stored in parameter file, all we needed to do is to edit the parameters file (in the production, of course) and that's it! Another good reason is that sometimes you can use the same mapping several times, each time with different parameters and that can ease the development for you.

How can we make it? We'll do it step by step (by the way, in order to learn this thing create a small workflow just for training):

Step 1 - Using Parameters in the Mapping
In your mapping, go to the Mappings menu and choose "Parameters and Variables...". There, you can create and edit your parameters (only within this mapping, of course). The name of the parameter should start with $$. (For example: $$MyParam). These are the properties you can edit for every parameter:

  • Name - Like I mentioned, should start with $$. Make the name as much descriptive as you can.
  • Type - Parameter can't be changed during the session and variable can. In our scenario we'll use parameters which are taken from parameters file, so choose Parameter.
  • Datatype - There's no much to explain.
  • Precision - If you're using string, make the precision at least 100 chars. It will prevent troubles when your parameters contains expression.
  • Scale - comes along with precision, where it relevant.
  • Aggregation - Relevant when you use multiple partitions in the pipeline. We'll leave it for now as the default.
  • IsExprVar - Very important. It determines if the parameter is an expression or a static value. Practically, if you use it inside an expression transformation, it should be True. Otherwise, it should be false.
After you defined your parameters, you can use it in your mapping in several ways:
  • Source Qualifier - you can override the whole SQ or add a source filter from parameter.
  • Lookup - you can override the lookup's sql or just its filter.
  • Expressions - Using a parameter as a placeholder inside expression is quite difficult because Informatica adds "" before and after the expression. Coming back to the example above, that's why we can't store 5 in the parameter. What we're doing to solve that is to store the whole expression in the parameter (the whole IIF(...)). After doing that, we've find out that this makes the parameter file clearer.
Step 2 - Build your parameter file
The parameters file, as its name suggest, holds the values of the parameters for the whole workflow. There are three types of lines in the param file:
  • Comment - starts with #. Very helpful for describing the other lines and for making order in the file.
  • Section headers - The first section contains the global parameters which are relevant for all the params in the workflow. The section starts with line contains only: [Global] (this should be the first line in your param file). After this (first) line, write all the global parameters. The other sections are session-specific, meaning that the params in these section will only affect one session. These section headers will look like this: [MyInformaticaFolder.WF:MyWorkflow.WT:MyWorklet.WT:MyInnerWorklet.ST:MySession]. The parameters that will come after this header will only affect the session called MySession which is inside the the worklet MyInnerWorklet and so on.
  • Parameters - each line in the file will contain only one parameter. The syntax is: $$MyParam=value. Don't write space between the equality sign (=) and the value because it will enter into the parameter itself.

As you can see, it's quite simple. Now, let's end this with the last step.

Step 3 - Attaching the parameter file to the workflow
Very easy. In the workflow manager, open the workflow and go to Workflows -> Edit -> Properties. Insert the parameter filename (with the path, of course) and that's it.
Now, for some important tips that will save you a lot of time:
  • The integration service reads the param file only when you run the workflow. Re-running a session or worklet after changing the file won't affect the results.
  • Unline what they write here, you can pass parameters to the mapplet. In order to do so, this is the correct syntax in the param file: MyMappletName.$$MyParam=TheValue. When this line is under the session header, it will only affect the mapplet called MyMappletName which is inside the session/mapping.
  • Don't use initial values for parameters. You won't know when the param file is not correctly attached to the session.

There's much more to discuss - using Informatica, you can use dynamic parameters and variables. You can get a value out from a session and use it in other sessions. I'll leave it for now for future posts.
Sunday, September 05, 2010 9:13:29 PM (Jerusalem Daylight Time, UTC+03:00)
 Thursday, July 15, 2010

No-one in my organization knew what is the difference between stop and abort in the Informatica Workflow Monitor. I googled it, and here is the answer in short term: Abort is killing the process in the underlying operating system, meaning that the memory that the session took for itself won't be released. Most of the OS don't have good-enough garbage collectors, meaning that this memory allocation won't be free until you'll restart the whole server(!). The stop command will ask the session to quit whenever it can, meaning that this will take more time.

Use the abort command only if you must stop the session right now. If you see that over time your informatica server goes slower and slower, restart the server in the weekend.

You can see the full answer here.

Thursday, July 15, 2010 10:37:52 PM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, January 19, 2010

This tip can save you a lot of time because when you don't follow it or you're not aware of it, finding the bug will take you a lot of time.

When you build your workflow or worklet in Informatica, you usually define dependencies between the tasks, meaning that task A will start only after task B finishes. You do it using the "Link Tasks" button in the toolbar. The issue is when you define for example that task C will run only after both task A and task B will finish. In that case, linking the tasks is not enough. You need to edit task C and choose "AND" in the "Treat the input links as" box. The default is "OR", meaning that the task will run when either task A or task B finishes. This can cause you a lot of trouble if task C is using data from task A and B (otherwise, why there's dependency between them?).

Tuesday, January 19, 2010 11:40:13 PM (Jerusalem Standard Time, UTC+02:00)
 Monday, November 09, 2009

In both SSIS and Informatica, the union transformation doesn't delete duplicate rows. In SSIS it's clear because the transformation is called "Union all", which reminds us the union all statement in SQL which doesn't remove duplicate rows. The confusing part is in Informatica, where the transformation is called Union although it doesn't remove duplicates. One of my teammates got confused this week so I thought it's worth blogging about.

One more tip about ETLs with union transformations: When you'll investigate, maintain or fix ETL it will be much help if you'll know where each row came from. That's why I recommend adding a column named "src" or something like that and store there the source of the row before the union happened.

Monday, November 09, 2009 11:41:31 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, October 27, 2009

Lucky for me, I have the opportunity to work with more than one ETL tool in my daily work. In the far past, I worked much with DTS and PL/SQL packages (which is an ETL, but not exactly a tool). Nowadays I work both with SSIS and Informatica so I can compare them in several aspects. I don't think that one is better than the other. It's just that they have different approaches to the ETL mechanism. Anyway, I always expect that the data transformations will behave the same in every tool and I found one transformation that behaves differently between the tools - the Router.

The Informatica's Router, if you're not familiar with, receives a data row and routes it a specific data flow. Its parallel transformation in SSIS is the Conditional Split. As I said, I always thought that they behave the same. In both transformations you provide boolean expressions that determine where the row will be routed. The difference is that in SSIS, "each input row can be sent to only one output, that being the output for the first condition that evaluates to true" (from SSIS's documentation). In programmer's words, it behaves like switch-case statement with break in each condition. In Informatica, "If a row meets more than one group filter condition, the Integration Service passes this row multiple times" (from Informatica's documentation). Back to the programmer's language, it's like switch-case statement with no break inside the conditions.

If you know other transformation that behaves different between ETL tools, I'll be happy to know.

Tuesday, October 27, 2009 11:10:34 PM (Jerusalem Standard Time, UTC+02:00)
 Thursday, July 09, 2009

The error function in Informatica can be very useful when you want to know why you have failed rows without failing the whole process. The use of the function is ERROR(string). Calling this function will make the error appear in the session log and the row will be ignored. For example: IIF(my_date_field > sysdate, ERROR('Futuristic Date'), my_date_field). This will ignore futuristic dates and show them only in the session log. By the way, you can declare default value for the port and the process will replace the ERROR with the default output.

Thursday, July 09, 2009 9:19:26 PM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, March 22, 2009

There are some things you can only learn in the hard way. It didn't happenned to me personally but to my team friends, so I consider it as it is my bad.

We upgraded our ETL tool - Informatica, from version 8.5 to 8.6. We had to run some tests to see that the results are the same. So, what we did is to save the result table from 8.5 in Excel, save the result table from 8.6 in Excel and then compare them using Excel-built-in functions. The only problem is we didn't pay attention to the places where zero and null interchanged. This happened because the two versions act differently where null values take places in aggregation functions. For example, when there is a sum function and it aggregates only on null values. In one version the output is zero and in the other the output is null...

This can also happen in other tools and technology. For example, in OLAP cubes the difference between zero and null is the difference between seeing the member of the dimension on the screen and not knowing of its existence.

For conclusion, always be aware to this point and don't forget to check it.

Monday, March 23, 2009 6:09:59 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, January 07, 2008

In the previous parts (1, 2) I showed how to connect Informatica with MS-OLAP, meaning that a mapplet can process cube or dimension. The thing is that I focused on the side of MS-OLAP. In the second part I even wrote the T-SQL code itself. Now I want to close the loop by describing what's going on in the Informatica side. This part was made by my friend, Alex, who permitted me to post here about what he did.

First of all, there's a table which contains the parameters to call with to the MS-OLAP procedure (object id, type, user name, etc.). This table is the source (& source qualifier, of course) of the mapplet. Each row in this table calls the stored procedure in the MS-OLAP side (in fact, the procedure is part of the relational DB, but never mind now). The call to the SP is made with Informatica's Procedure block. The connection is a regular ODBC connection, as mentioned in the previous part. Now for the interesting part: In the mapplet, the result of the procedure (zero for success, one for failure) goes into a Java Transformation block. This java block will fail the mapplet if one or more procedure calls returned failure.

How to do this java block? Double click on it to enter its properties. Go to the "Java Code" tab. There you'll see tab for every event in this block's life cycle. Here is the code for every tab (only the relevant ones):

Helper Code:

static int errorCounter = 0;
static Object lock = new Object();

On Input Row:

if (returnValue != 0)
{
 synchronized(lock)
  {
   errorCounter++;
  }
}

On End of Data:

synchronized(lock)
{
 if (errorCounter > 0)
  {
   failSession("OLAP Objects failed");
  }
}

Note that:

  • I'm not sure that the lock mechanism is required here. sync, lock, semaphore, etc. mechanisms are often used when atomic write is needed in order to solve problems like deadlocks, mutual write, etc. Here I simply don't care. Even if two parallel threads will read the errorCounter as zero and they both will increase it to one (when in fact it needs to have the value of two) it won't be a bug because the session will fail anyway. Alex & I need to talk about this point...
  • failSession is a function which is part of Informatica's API. As you might guess, it will fail the whole mapplet.
  • Very important: Calling all the MS-OLAP objects at once will cause an error in the Analysis Services server and all the objects will be in the state of Unprocessed. The Informatica side has to call the dimensions first and only then the cubes. The cubes must not be called all at once if they have relationships between them. This will cause deadlock too.
Tuesday, January 08, 2008 5:35:17 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, December 30, 2007

Just got home. Most of my day (and my co-worker's also) went on with a big installation of the second block of our BI project. In the morning we really thought that maybe this time, yeah - just this time things will go better. After more than 12 hours I laughing at myself: How could I be so naive? Many things that could go bad just did but after it all ended (with a happy ending, otherwise I wouldn't be here, writing in my home sweet home) I can say that the big blame is on Informatica PowerCenter. We're using version 8 of the software. It's not new software that started its way yesterday: It's a very old and familiar software. So how can it be that when we copy mapplets (ETL processes, for those of you who don't know Informatica) from one repository to another, some lines are just deleted from the mappings? After that you check your dimensions in MS-OLAP and you don't understand what happened there. A whole level in a big dimension that has only one member - 0 ?? Zero member is null. Yeah, we were right - the line in the mapping just been deleted by our precious Informatica so the column is all null.

Well, I happy we're through with this. Good night.

P.S.
Tomorrow I'm taking a day off... :-)

P.S 2

Although many things went wrong in the installation, I really think we had a good block this time. This block contains many beautiful things in MS-OLAP, MDX and Informatica. You'll see it here in the next few days, after I'll calm down. :-)

Monday, December 31, 2007 6:23:16 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, December 02, 2007

In the last post, I explained the architecture of our BI project. The final part of the process is processing MS-OLAP object (cube/dimension) from Informatica mapplet. As explained earlier, the trick is to call Stored Procedure from the Informatica server. But first there is one more thing to do: How do you connect the Informatica server (Linux) with MS-OLAP (windows server)?

Informatica ships with number of drivers that can connect it with other servers. The drivers are called DataDirect and I'll discuss 4.20. You need to define this driver on the Informatica server (look in Informatica knowledge base for more information). This is an easy thing to do. Notice that you have to enter a full server name (including domain) and the password. Remember that if you'll change the password in the future the process will fail. You have to enable the protocol named "Named Pipes" in the MS-OLAP server. How to do this? Enter the Configuration Manager in the MS-OLAP server and in the section of MSSQLSERVER protocols enable the Named Pipes protocol. This will enable the connection from the Informatica server. On the Informatica server, make a regular ODBC connection.

Here is the code of the SP on the MS-OLAP side. This SP must be on the msdb Database on the Database engine.

ALTER PROCEDURE [dbo].[ProcessObject]
@databaseId varchar(100),
@objectType varchar(100),
@objectId varchar(100),
@login_name varchar(100),
@returnValue int output,
@errorMessage nvarchar(1024) output
AS
BEGIN
declare @jobName varchar(200)
declare @xmla varchar(1000)
declare @jobId binary(16)
declare @ReturnCode int
declare @stop int

--Set job name
set @jobName = 'Process' + @objectType + '_' + @objectId

--Delete the job if already exists
if exists (select * from msdb.dbo.sysjobs where name = @jobName)
exec msdb.dbo.sp_delete_job @job_name = @jobName

--Create the job
Exec msdb.dbo.sp_add_job @job_name=@jobName
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'process OLAP object',
@category_name=N'[Uncategorized (Local)]',
@owner_login_name=@login_name, @job_id=@jobId output

exec msdb.dbo.sp_add_jobserver @job_name=@jobName, @server_name=@@SERVERNAME

--Declare XMLA for OLAP object
if (@objectType = 'Cube')
set @xmla = '

' + @dataBaseId + '
' + @objectId + '

ProcessFull
'
else if (@objectType = 'Dim')
set @xmla =



' + @dataBaseId + '
' + @objectId + '

ProcessFull


'
else
Begin
set @returnValue = 0
return @returnValue
End

--Add the job step
Exec msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Process Object',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'ANALYSISCOMMAND',
@command=@xmla,
@server=@@SERVERNAME,
@database_name=N'master',
@flags=0

--Run the job
Execute sp_start_job @jobName

Waitfor delay '00:00:05'
set @returnValue = (select run_status from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)

-- Loop until the job ends and return its result
set @stop = 0
if @returnValue is null
while @stop <> 1
Begin
set @returnValue = (select run_status from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)

if @returnValue is not null
set @stop = 1

waitfor delay '00:00:10'
End

--Return error message (if exists)
If @returnValue = 0 --failed
set @errorMessage = (select [message] from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)
End

update: I see that the xmla code went bad in the post because it is not recognised html code. It doesn't matter, I believe you got the point...

Sunday, December 02, 2007 5:59:42 PM (Jerusalem Standard Time, UTC+02:00)
 Saturday, December 01, 2007

In the past I mentioned some fragments of the architecture of our end-to-end BI solution. Now I'll discuss how it is done. I will only write about the things that I done (I mean, developed) but I'll describe the whole picture.

Our architecture goes like this: Control M -> Informatica -> MS-OLAP (Analysis Services 2005).

In words: ControlM is the most common scheduler in big companies. We use it to schedule our ETL processes in Informatica. Our system team made it possible to start Informatica processes from ControlM. I don't know exactly how it is done. All I know is that ControlM raises a flag in a table, and Informatica scans the table every X seconds and start the process if it finds the flag that was raised by ControlM. Don't ask me about the technical details - it wasn't my job.

The more interesting thing (because I did it...) is how Informatica calls MS-OLAP and tells it to process a cube. In this part I'll describe the big picture and in the next one I'll give some of the code and discuss some technical views of the process. First, the Informatica mapping moves the data from the source to the target, which is the dimension or fact table just like it always does. After that, Informatica calls a Stored Procedure on the MS-OLAP server which process the cube. Informatica calls this SP with some parameters, including the object type to process (cube/dimension), its ID and some more parameters. In return, the MS-OLAP returns return code (in order to point whether the process succeeded) and message describing the error if it occurred.

How the SP process the cube/dimension? Unfortunately, there is no SP that can process OLAP object so I needed to use the following steps in my SP:

  1. Delete any existing job that does the same action (read on, you'll understand)
  2. Create an empty job
  3. Add a step to that job that will process the object. This step contains XMLA code that contains the parameters that were given from Informatica
  4. Run the job
  5. Loop until the job (or process) ends and send back the return code and the error message, if exists.

In the next part I'll write some of the code and discuss some technical issues.

 

Sunday, December 02, 2007 5:19:47 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 22, 2007
I believe that every BI developer seen this in many Data Warehouses: Boolean Dimensions. As you may guess, boolean dimension is a dimension with only two members and of course with no hierarchy. For example: cash/credit card in sales cube, exists/not exists in inventory cube, etc. If you haven't seen this phrase before - relax - I just invented it. :-)
Now, the question is what to do about these dimensions:
a. Include them in the ETL process or just leave it as is?
b. If you put it in the ETL - how would you implement it?

Here's what I did in my project. You may disagree with me and I would like to see other approaches too.
a. Yes, I included it for some reasons. As every Pragmatic Programmer knows, everything can be changed so do not assume anything as globally-static. This rule takes place in here: Boolean dimensions may grow and have more members. For example, in the sales cube I mentioned above, maybe there will be another way to pay such as exclusive card of the shop (There is a network here in Israel who has it). Even male/female boolean dimension may have an Unknown member. So never exclude these dimensions from your ETL process. Wait - one more thing. You may think: Why interrupt my ETL process with these silly dimensions? If they'll grow up I'll add them to the process. As an answer think about the timings: You can never know how much time the dimension's ETL will take (although it will be very small), so in order to stay away of surprises - include it in your ETL process. just for case.
b. I implemented it as two hard-coded expressions and sent them to union. The result of this union will enter directly to the target table. In Informatica, the mapplet can't start without source table so just put a dummy table with only one row and connect it to the expression items. Why only one row? If the table will contain more than two rows then the Informatica server will consider the process as failed one.

As I said, I'll be happy to read other approaches other than mine.

Monday, October 22, 2007 7:44:45 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 15, 2007
I really think that the time dimension is the most complex dimension in 90% of the DWHs. The complexity is in two places: In the DWH design and also in Analysis Services (or any other BI tool).
First of all - why we didn't take the already-made Server Time Dimension which exists in SSAS 2005? For two reasons: The first is that the Project Real guys do not recommend using it (you can find their SSAS article here). The second is that we wanted to have some features that are not available in the server time dimension, such as Hebrew date. In a matter of fact, even if we didn't have such feature we still would build the time dimension ourselves because it's giving you much more control over the dimension. For example, you can always add some new attributes which Microsoft developers didn't think about.
I started myself to build the time dimension in excel. I figured out that this mission is little more complex that I thought it would. Most of the functions I wrote were simple, but there were some complicated ones. So here are some tips for you if you want to build your time dimension using Excel:
  • If you want to week number for every date, do not write the function yourself... Excel has function called weeknum. If you don't have it just add the function toolbox which has it (I can't recall its name right now. check in excel help).
  • If you want to have records for every level in your hierarchy (not only for days), put every level in its own excel file (not excel tab). It will help you later when you will transfer it to your DB.
  • Check yourself. Pick randomly some dates and check that all of its record has correct data.
After building the excel files I needed to transfer it to my Oracle server. I used SSIS because I didn't want to wait for my DBA to copy these files into the Informatica server (it can't use my the local files, it has to be in its server. SSIS can use local files). This also was a little tricky. First of all, close excel when running the SSIS packages, otherwise it will fail. Second, when moving the non-leaf levels, go into the columns section in the destination box and erase the irrelevant columns. It will reduce the chance for errors. Finally, click on the source box and click on "Show advanced editor". Enter the source's output columns options and define properly the columns' data types. This also will reduce the chance for errors.

I had a little bit of an argument with my DBA about how should the time dimension be. I think that the time dimension does not have to be processes at all. My time dimension is from 1960 until 2020, so no daily ETL is required. She says that all the logic has to be in Informatica so I need to develop a mapping for this dimension. I think that we both are right and that's because that in ideal world she is right. In every developers team, all the BL has to be in one place. But we don't have much time (the deadline is very close) so I won't spend the time building more mapping in Informatica when I have the time dimension already made in excel.

Maybe someday I will have the time to do this. Maybe not.

Monday, October 15, 2007 8:18:38 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, October 14, 2007
I guess that this will not be my last post on this subject, but I want to start sharing some thoughts and tips from my experience when designing and building DWH. In this post I will focus on the fact & dimensions tables relationship in terms of data completeness (if you wonder what it is, read on).

Before you start to design the DWH, sit and talk with the people who built the systems which you take your data from, including the DBA. For every table, ask them what is the primary key (it's NOT always defined properly in the DB), then ask them again and then ask them if they are sure. It happened to me that I discovered that the systems guys were wrong about their DB's primary keys.
The same thing is about Foreign keys and here you should be even more careful. Even if they claim so, check yourself that every foreign key in the fact table is placed correctly in the dimension table, especially when the fact table has far history records. Sometimes system developers or even worse - system DBAs delete records from the dimension tables that are not relevant. This will cause that these keys will still be in the fact's history records but will not be found in the dimension table, causing uncomplete relationship between the fact and the dimension table.

So far is about the part when you talk and "investigate" the system developers (the DWH design). What to do when you actually developing the DWH? First, develop the tables of your dimensions tables. Do not forget to add the primary keys in the dimensions tables and the primary and foreign keys in the fact table. Then develop the ETL processes and go for the dimensions first.  If you know that the dimension has completeness problems with the fact table that you will develop later (you talked with the system developers, remember?), add UNDEFINED (UD key) record for the dimension table. Later, when developing the fact table's ETL process, make Join with the dimension table and check that the records' foreign key exists there. If not - change the key to UD. In SSIS and Informatica (and I guess that also in other products I don't know, such as DataStage) you can use Lookup instead of Joiner if the dimension table is less that 1G records. That will optimize the ETL process. After you developed all your ETLs, run the dimension processes. After they finish (assuming everything went OK) run the fact table's ETL process. If it succeeded you can go and have a drink. If not - check what went wrong. If you want to know which keys didn't showed up in the dimension table and causes the incompleteness problem, you can disable (not delete) the foreign key from the fact table and run the process again. Then, with a simple SQL query, check which foreign keys don't exist in the dimension table. Go back to your ETL design and check what you did wrong. As I pointed before, in this step you might be very angry at the system developers...

That is all for now. As I said, I assume that more ideas will come on in the future.

Sunday, October 14, 2007 7:11:49 AM (Jerusalem Standard Time, UTC+02:00)