Sunday, December 02, 2007

In the last post, I explained the architecture of our BI project. The final part of the process is processing MS-OLAP object (cube/dimension) from Informatica mapplet. As explained earlier, the trick is to call Stored Procedure from the Informatica server. But first there is one more thing to do: How do you connect the Informatica server (Linux) with MS-OLAP (windows server)?

Informatica ships with number of drivers that can connect it with other servers. The drivers are called DataDirect and I'll discuss 4.20. You need to define this driver on the Informatica server (look in Informatica knowledge base for more information). This is an easy thing to do. Notice that you have to enter a full server name (including domain) and the password. Remember that if you'll change the password in the future the process will fail. You have to enable the protocol named "Named Pipes" in the MS-OLAP server. How to do this? Enter the Configuration Manager in the MS-OLAP server and in the section of MSSQLSERVER protocols enable the Named Pipes protocol. This will enable the connection from the Informatica server. On the Informatica server, make a regular ODBC connection.

Here is the code of the SP on the MS-OLAP side. This SP must be on the msdb Database on the Database engine.

ALTER PROCEDURE [dbo].[ProcessObject]
@databaseId varchar(100),
@objectType varchar(100),
@objectId varchar(100),
@login_name varchar(100),
@returnValue int output,
@errorMessage nvarchar(1024) output
AS
BEGIN
declare @jobName varchar(200)
declare @xmla varchar(1000)
declare @jobId binary(16)
declare @ReturnCode int
declare @stop int

--Set job name
set @jobName = 'Process' + @objectType + '_' + @objectId

--Delete the job if already exists
if exists (select * from msdb.dbo.sysjobs where name = @jobName)
exec msdb.dbo.sp_delete_job @job_name = @jobName

--Create the job
Exec msdb.dbo.sp_add_job @job_name=@jobName
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'process OLAP object',
@category_name=N'[Uncategorized (Local)]',
@owner_login_name=@login_name, @job_id=@jobId output

exec msdb.dbo.sp_add_jobserver @job_name=@jobName, @server_name=@@SERVERNAME

--Declare XMLA for OLAP object
if (@objectType = 'Cube')
set @xmla = '

' + @dataBaseId + '
' + @objectId + '

ProcessFull
'
else if (@objectType = 'Dim')
set @xmla =



' + @dataBaseId + '
' + @objectId + '

ProcessFull


'
else
Begin
set @returnValue = 0
return @returnValue
End

--Add the job step
Exec msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Process Object',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'ANALYSISCOMMAND',
@command=@xmla,
@server=@@SERVERNAME,
@database_name=N'master',
@flags=0

--Run the job
Execute sp_start_job @jobName

Waitfor delay '00:00:05'
set @returnValue = (select run_status from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)

-- Loop until the job ends and return its result
set @stop = 0
if @returnValue is null
while @stop <> 1
Begin
set @returnValue = (select run_status from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)

if @returnValue is not null
set @stop = 1

waitfor delay '00:00:10'
End

--Return error message (if exists)
If @returnValue = 0 --failed
set @errorMessage = (select [message] from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)
End

update: I see that the xmla code went bad in the post because it is not recognised html code. It doesn't matter, I believe you got the point...

Sunday, December 02, 2007 5:59:42 PM (Jerusalem Standard Time, UTC+02:00)
 Saturday, December 01, 2007

In the past I mentioned some fragments of the architecture of our end-to-end BI solution. Now I'll discuss how it is done. I will only write about the things that I done (I mean, developed) but I'll describe the whole picture.

Our architecture goes like this: Control M -> Informatica -> MS-OLAP (Analysis Services 2005).

In words: ControlM is the most common scheduler in big companies. We use it to schedule our ETL processes in Informatica. Our system team made it possible to start Informatica processes from ControlM. I don't know exactly how it is done. All I know is that ControlM raises a flag in a table, and Informatica scans the table every X seconds and start the process if it finds the flag that was raised by ControlM. Don't ask me about the technical details - it wasn't my job.

The more interesting thing (because I did it...) is how Informatica calls MS-OLAP and tells it to process a cube. In this part I'll describe the big picture and in the next one I'll give some of the code and discuss some technical views of the process. First, the Informatica mapping moves the data from the source to the target, which is the dimension or fact table just like it always does. After that, Informatica calls a Stored Procedure on the MS-OLAP server which process the cube. Informatica calls this SP with some parameters, including the object type to process (cube/dimension), its ID and some more parameters. In return, the MS-OLAP returns return code (in order to point whether the process succeeded) and message describing the error if it occurred.

How the SP process the cube/dimension? Unfortunately, there is no SP that can process OLAP object so I needed to use the following steps in my SP:

  1. Delete any existing job that does the same action (read on, you'll understand)
  2. Create an empty job
  3. Add a step to that job that will process the object. This step contains XMLA code that contains the parameters that were given from Informatica
  4. Run the job
  5. Loop until the job (or process) ends and send back the return code and the error message, if exists.

In the next part I'll write some of the code and discuss some technical issues.

 

Sunday, December 02, 2007 5:19:47 AM (Jerusalem Standard Time, UTC+02:00)
 Wednesday, November 21, 2007

I never thought of it until one of my users said it. Sorting the KPI can be a very good idea. Instead of always having the same indicators (gauges, traffic lights, etc.) in the same position of the screen, sorting it can make the viewer expect that the most relevant indicator will be placed in the top-left corner of the screen, the second most-relevant will be placed after it, etc.

Sorting the KPI is a very easy thing. Every end user, even with no clue in MDX can do it by following this:

In the Define KPI wizard, go to the "Select Set" step. Copy the current set. For the example, let's say that the current set is [Products].Members and you want to sort it according to the Sales measure in descending order. Click on the advanced button on the right of the set (The button with the "..." on it) and enter the following MDX statement:

Order([Products].Members, [Measures].[Sales], DESC)

If you want to order in ascending order you can replace the DESC with ASC or not to mention it at all.

 |  | 
Thursday, November 22, 2007 6:24:44 AM (Jerusalem Standard Time, UTC+02:00)

If you see "No Data" after you entered a view with defined KPI, most chances that the reason has something to do with privileges, but today I've seen one more thing you can do that will make the KPI go crazy.

One of my users built a view and he removed all the measures but one. After that, when he defined the set of his KPI, he chose the set of the measures. That caused the "No Data" title when he wanted to see his KPI view.

 | 
Thursday, November 22, 2007 5:45:19 AM (Jerusalem Standard Time, UTC+02:00)

One of my users had a very weird problem today. When he entered a view with KPI gauges, he could see everything but the gauges. All was there: The titles, numbers, etc. but the gauges itself. The problem occurred in the Panorama Web Access site and also in the dashboard site. I checked with other users and they didn't have this problem (with the same views, of course).

After a few minutes I found the problem: The Explorer process in windows used too much memory and it caused visual problems in the browser. I ended the process, restarted it (Ctrl + Shift + Esc -> File -> New task -> explorer) and everything went back to normal.

 | 
Thursday, November 22, 2007 5:39:21 AM (Jerusalem Standard Time, UTC+02:00)

Important note: The user working with Panorama NovaView Desktop must have write privileges on the Panorama folder (The default is: C:\Program files\Panorama). The program saves its data there, so it will cause many troubles if it won't be able to save. For example, when you start the program and click on the globe (work on briefing book from the server) you'll have to enter the panorama's server name every time you start the program (if you don't have the mentioned privilege).

Don't worry: The user watching the views doesn't need to have any privilege on its computer. He only needs the right to see the view or the dashboard page.

 | 
Thursday, November 22, 2007 5:27:33 AM (Jerusalem Standard Time, UTC+02:00)

I had a little challenge back in work and I solved it (in a couple of hours). Here is the description of the problem and its solution:

Let's say that I have a Products dimension, time dimension and a fact table that describes all the faults which occurred in these products. The new requirement is that: Given a product-tree that describes all the parts of every product, I want to know how many faults occurred to every part. The problem is that the fact table points only to the products and the customer wants to know how many faults happened to the parts.

The two solutions I thought about are:

1. The trivial solution: Build a view above the fact table that takes every row and adds a row for each of its product's parts. That way, the fact will contain every fault that happened to every part. The problem with this solution is that the view is very long to compute.

2. The good solution: Build a parent-child dimension out of the parts table which will describe all the parts of every product. Notice that this dimension is not ragged, meaning that one member can have 2 children while another member in the same level can have 10 children. The next step is to add this MDX Script:

Calculate;


Scope([Measures].[Faults],
         Descendants([Products].[Products].[All],
                           1,
                           After));


   This = Ancestor([Products].[Products].CurrentMember,
                         [Products].[Products].[Level 02]);


End Scope;


Scope([Measures].[Faults],
         [Products].[Products].[Level 02].Members);


   This = ([Products].[Products].CurrentMember)
             -
             Sum([Products].[Products].CurrentMember.Children);


End Scope;

Explanation: The first level of the dimension is the [All] member. The second level is the products and the other levels contain the parts. The first block takes all the parts and inherits their amount of faults from their product ancestor. The second block solves the aggregation problem: Let's say that The bike product has 3 children. The bike had 4 faults in our slice of time, so according to the first block of the script every child has 4 faults. Now, the cube makes its aggregation and now the bikes has 16 faults - 4 of its own and more 4 for every of its parts. The second block decreases the sum of the product's children from the products and the result is that the products have their original number of faults.

 | 
Thursday, November 22, 2007 5:12:21 AM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, November 06, 2007
This very useful option in Panorama can prevent much pain. Often, you don't want your CEO(s) to play with the views you created in Panorama. The beautiful dashboard page you created is what you want them to see and that's it.
In the dashboards site, select the component holding the desired view and in its Toolbar options, check the "Disable Analysis" checkbox. This will prevent the user from slicing and dicing with your view.

 | 
Wednesday, November 07, 2007 5:10:19 AM (Jerusalem Standard Time, UTC+02:00)
After our first MS-OLAP first installation, I started asking myself some questions: The way we moved our cubes to the production server is by the Visual Studio. We just deployed the cubes into the production server. The problem was that we forgot that we had some changes done in the XMLA code that lies behind one of our cubes. That caused some troubles in the installation and the result was that we did some changes in the XMLA script in the production server after the installation...
So, how we should make the installation? Should we export the whole database and import it in the production server, or should we generate XMLA scripts and run it in the production server? Is there really a difference between these choices?
If someone has an answer I would be very happy to read it. Thanks...

Wednesday, November 07, 2007 4:17:55 AM (Jerusalem Standard Time, UTC+02:00)
Yesterday, my team installed our first end-to-end BI project which includes many familiar technologies (for us, of course) such as Informatica, Oracle and Control-M along with new technologies (again - for us) such as Analysis Services 2005 and Panorama. This is the first project we have with SSAS 2005. Cheers for us.

We had some failures along the way, so we sat down today with our DBA team in order to investigate the good and bad things we had in the installation. The most important conclusion we got to is that the installation documentation just wasn't good enough. We did it quick and quite dirty because we thought that its purpose is to keep up with the QA checks. We forgot some important issues and wrote down the other things in a shoddily way. My conclusion is this: First, write an installation document and keep attached to it when you're doing the installation even if you sure that you know the drill 'cause you did it a hundred times. Second, write a good installation document for you, your mates, and the future co-workers to come. Third, write everything you did in the installation if it wasn't in the document.

Don't ignore the installation document. It can make the difference between success and failure in the installation.

Wednesday, November 07, 2007 4:09:55 AM (Jerusalem Standard Time, UTC+02:00)
It took me a while (more than 15 minutes!) to find something so elementary. The thing is how to call a sleep function in T-SQL in SS2005. The statement is: Waitfor Delay <DelayLength>, where <DelayLength> can be in the format: '00:00:10' for ten seconds. the <DelayLength> can be also a parameter of the type char(8).

Wednesday, November 07, 2007 3:40:13 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 29, 2007

My brother-in-law got a new Wii console last week and I played with it for a while. After two days I just got enough of it. The Wii is a cool thing, but it's mostly a gimmick. The games are not as fun as usual PC games and don't have a long life-time. I'm planning to buy a new PC with a good VGA card in the next week, so I'll can play good games on my own. Don't waste your money on Wii.

Tuesday, October 30, 2007 4:09:04 AM (Jerusalem Standard Time, UTC+02:00)

As you can see from the title, we use VSS with AS 2005. The reason for this is that we want to have backup and source control in our AS projects. There are some interesting points when working that way which I would like to share:

  • When you process a cube in a VSS-contolled AS project, the compiler writes the exact time and hour of the last process in the .database file in the solution. This file is a small and simple XML file. This piece of data is written in the <LastProcessed> element. That is the reason why VSS will ask to check-out this file every time you will process a cube. You can ignore it and cancel it twice and the cube will be processed anyway. Of course, the last processing time will not be correct.
  • The .dwproj file is the binder of the solution (as in regular code project). That means that Visual Studio will update this file every time you add, delete or rename an object to the solutions. When two (or more) developers doing these king of actions together it will cause a disaster. Only the objects of the developer who checked-in last will be saved in the project. What you need to do is manually edit this file and add the object's elements yourself. That's not very hard, but you'll need to concentrate. As always, I recommend using Notepad++.
  • When editing the .cube file yourself (when merging manually after two developers worked on the same cube), make sure that the dimensions ID's in the .cube file match the ID in the .dim files.
  • AS projects can be easily destroyed because of an error in a VSS-related decision. Think twice when clicking in its annoying dialog boxes. Make a label every day after building, deploying and processing the project.
Tuesday, October 30, 2007 4:03:45 AM (Jerusalem Standard Time, UTC+02:00)

We use VSS with Analysis Services 2005 in order to have source control and backup (I think this deserves a post of its own). We've been shocked to find out that our VSS didn't save our labes and our versions. The defect we experienced is that after clicking on the history button and then on the OK button in the "choose date range" dialog box, nothing happened. We were sure that no history was saved and that is the reason that nothing happened. After a long search in the web, I found this thread in the MSDN. The problem is a bug in VSS that has something to do with dates representation in Microsoft's code. The workaround is easy, but it seems to work only in Windows 2000: Open the VSS admin program, then Options -> Tools -> TimeZone, and set the time zone as none. That's it.

By the way, notice that my second post in my blog was about date formats and how tricky and dangerous they can be.

Tuesday, October 30, 2007 3:43:05 AM (Jerusalem Standard Time, UTC+02:00)

This is the second bug I found when working with AS and Oracle as my database (you can read about the first one here). Some background about our Datawarehouse architecture before I begin to complain about Microsoft:

We cannot afford risking that our users will experience faults or crashes while we refresh our DWH, process our cubes, etc. What we do to solve this problem is called a switch: We have two schemas in our Oracle DB. While our users watch the first schema, we update the other one. Only after we finish all our load process, we switch our user's tools to see the data from the second schema. In order to implement this architecture we use Oracle's synonyms. Let's say that the users watch the fact table "sales". We have a synonym which is called sales_syn. While it's pointing to the first schema (schema_a.sales_fact), we're loading into the second schema (schema_b.sales_fact). After that, we switch the synonym so it will point to the second schema (schema_b). The users always look on views that rely on synonyms. The views never change, only the synonyms do.

The problem starts when we make Named Queries in our Data Source Views in AS. Apparently, AS looks inside the view that we enterd into the DSV, finds the target of the synonym and saves it. Even after we make the switch (the synonym points on the other schema), the named query will be pointing on the first one. All our efforts to edit the named query have failed. When we open it again, we will always find that the first schema is "burned" there and cannot be changed.

The solution we chose is simply avoid using named queries. If we need a simple calculation we can add a Named Calculation in the Data Source View and if we need a complex view over our fact table, we write it ourself in the DB. This solutions breaks a little bit the main point of the data source view (designated place for all the logic of the DWH), but it is the best solution we could think of right now.

Dear Microsoft developers - it seems that you tried to be smart and look into our Oracle objects in order to enhance the multidimensional database's performance. Next time, please think twice before you do.

Tuesday, October 30, 2007 3:24:36 AM (Jerusalem Standard Time, UTC+02:00)

It seems that four days before I wrote my post about Panorama Hidden Settings, Panorama entered all the registry keys into their knowledge base. You can find them here. Strage: I seeked it for a long time and now I see that it was always there, under my nose.

 | 
Tuesday, October 30, 2007 2:58:13 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 22, 2007
If you read my blog from my home page and not via RSS or RSS-based sites, you may see it in the right column of the web page. You can click there and I will have a call from you (if you have Skype installed). If you need any help or explanation about BI stuff, just click and ask. Oh, by the way, you need to have some cash in your Skype account in order to pay.
I'm not trying to be greedy. I'm just trying to earn a little bit of money from my knowledge. If you need a little assistance regarding to BI, SQL Server or Panorama - ask me. If the answer will be quick I will not charge you at all.

So, pick up the phone... ;-)

Monday, October 22, 2007 7:51:52 AM (Jerusalem Standard Time, UTC+02:00)
I believe that every BI developer seen this in many Data Warehouses: Boolean Dimensions. As you may guess, boolean dimension is a dimension with only two members and of course with no hierarchy. For example: cash/credit card in sales cube, exists/not exists in inventory cube, etc. If you haven't seen this phrase before - relax - I just invented it. :-)
Now, the question is what to do about these dimensions:
a. Include them in the ETL process or just leave it as is?
b. If you put it in the ETL - how would you implement it?

Here's what I did in my project. You may disagree with me and I would like to see other approaches too.
a. Yes, I included it for some reasons. As every Pragmatic Programmer knows, everything can be changed so do not assume anything as globally-static. This rule takes place in here: Boolean dimensions may grow and have more members. For example, in the sales cube I mentioned above, maybe there will be another way to pay such as exclusive card of the shop (There is a network here in Israel who has it). Even male/female boolean dimension may have an Unknown member. So never exclude these dimensions from your ETL process. Wait - one more thing. You may think: Why interrupt my ETL process with these silly dimensions? If they'll grow up I'll add them to the process. As an answer think about the timings: You can never know how much time the dimension's ETL will take (although it will be very small), so in order to stay away of surprises - include it in your ETL process. just for case.
b. I implemented it as two hard-coded expressions and sent them to union. The result of this union will enter directly to the target table. In Informatica, the mapplet can't start without source table so just put a dummy table with only one row and connect it to the expression items. Why only one row? If the table will contain more than two rows then the Informatica server will consider the process as failed one.

As I said, I'll be happy to read other approaches other than mine.

Monday, October 22, 2007 7:44:45 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 15, 2007
I really think that the time dimension is the most complex dimension in 90% of the DWHs. The complexity is in two places: In the DWH design and also in Analysis Services (or any other BI tool).
First of all - why we didn't take the already-made Server Time Dimension which exists in SSAS 2005? For two reasons: The first is that the Project Real guys do not recommend using it (you can find their SSAS article here). The second is that we wanted to have some features that are not available in the server time dimension, such as Hebrew date. In a matter of fact, even if we didn't have such feature we still would build the time dimension ourselves because it's giving you much more control over the dimension. For example, you can always add some new attributes which Microsoft developers didn't think about.
I started myself to build the time dimension in excel. I figured out that this mission is little more complex that I thought it would. Most of the functions I wrote were simple, but there were some complicated ones. So here are some tips for you if you want to build your time dimension using Excel:
  • If you want to week number for every date, do not write the function yourself... Excel has function called weeknum. If you don't have it just add the function toolbox which has it (I can't recall its name right now. check in excel help).
  • If you want to have records for every level in your hierarchy (not only for days), put every level in its own excel file (not excel tab). It will help you later when you will transfer it to your DB.
  • Check yourself. Pick randomly some dates and check that all of its record has correct data.
After building the excel files I needed to transfer it to my Oracle server. I used SSIS because I didn't want to wait for my DBA to copy these files into the Informatica server (it can't use my the local files, it has to be in its server. SSIS can use local files). This also was a little tricky. First of all, close excel when running the SSIS packages, otherwise it will fail. Second, when moving the non-leaf levels, go into the columns section in the destination box and erase the irrelevant columns. It will reduce the chance for errors. Finally, click on the source box and click on "Show advanced editor". Enter the source's output columns options and define properly the columns' data types. This also will reduce the chance for errors.

I had a little bit of an argument with my DBA about how should the time dimension be. I think that the time dimension does not have to be processes at all. My time dimension is from 1960 until 2020, so no daily ETL is required. She says that all the logic has to be in Informatica so I need to develop a mapping for this dimension. I think that we both are right and that's because that in ideal world she is right. In every developers team, all the BL has to be in one place. But we don't have much time (the deadline is very close) so I won't spend the time building more mapping in Informatica when I have the time dimension already made in excel.

Maybe someday I will have the time to do this. Maybe not.

Monday, October 15, 2007 8:18:38 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, October 14, 2007
I guess that this will not be my last post on this subject, but I want to start sharing some thoughts and tips from my experience when designing and building DWH. In this post I will focus on the fact & dimensions tables relationship in terms of data completeness (if you wonder what it is, read on).

Before you start to design the DWH, sit and talk with the people who built the systems which you take your data from, including the DBA. For every table, ask them what is the primary key (it's NOT always defined properly in the DB), then ask them again and then ask them if they are sure. It happened to me that I discovered that the systems guys were wrong about their DB's primary keys.
The same thing is about Foreign keys and here you should be even more careful. Even if they claim so, check yourself that every foreign key in the fact table is placed correctly in the dimension table, especially when the fact table has far history records. Sometimes system developers or even worse - system DBAs delete records from the dimension tables that are not relevant. This will cause that these keys will still be in the fact's history records but will not be found in the dimension table, causing uncomplete relationship between the fact and the dimension table.

So far is about the part when you talk and "investigate" the system developers (the DWH design). What to do when you actually developing the DWH? First, develop the tables of your dimensions tables. Do not forget to add the primary keys in the dimensions tables and the primary and foreign keys in the fact table. Then develop the ETL processes and go for the dimensions first.  If you know that the dimension has completeness problems with the fact table that you will develop later (you talked with the system developers, remember?), add UNDEFINED (UD key) record for the dimension table. Later, when developing the fact table's ETL process, make Join with the dimension table and check that the records' foreign key exists there. If not - change the key to UD. In SSIS and Informatica (and I guess that also in other products I don't know, such as DataStage) you can use Lookup instead of Joiner if the dimension table is less that 1G records. That will optimize the ETL process. After you developed all your ETLs, run the dimension processes. After they finish (assuming everything went OK) run the fact table's ETL process. If it succeeded you can go and have a drink. If not - check what went wrong. If you want to know which keys didn't showed up in the dimension table and causes the incompleteness problem, you can disable (not delete) the foreign key from the fact table and run the process again. Then, with a simple SQL query, check which foreign keys don't exist in the dimension table. Go back to your ETL design and check what you did wrong. As I pointed before, in this step you might be very angry at the system developers...

That is all for now. As I said, I assume that more ideas will come on in the future.

Sunday, October 14, 2007 7:11:49 AM (Jerusalem Standard Time, UTC+02:00)