Thursday, June 23, 2011

The problems: A. Storage costs money, much money. B. Today's storage is very slow and we want fast retrieve of data.

The Solution: Take the commonly-used data and put it in very fast and expensive storage. Take the less used data and put it on slower and cheaper storage. The overall budget should be much lower.

Link (pdf)

Friday, June 24, 2011 3:42:33 AM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, March 29, 2011

We're doing something very useful at our workplace and because I believe in collaboration I would like to share it with you, dear readers. I'm sure it will help a lot of system analysts and developers.

When we write down the design of the ETL process, especially regarding complex fact tables, we find it necessary to write it down as a flowchart. This creates a simple and common language between the designer/analyst and the developer and even with the customer. It's much easier for everyone to read a flowchart rather than reading a long and complex document. Therefore, we create diagrams such as these using Visio (this diagram was created using Google Docs, but it's preety much the same):

I only brought here the beginning of the diagram. The real one is 3 times bigger.
Note that every leaf in the tree has its own number. These numbers indicates the record's "case number". Now for the trick: In the target table (fact, dimension or anything else) we add a new column that contains the number of the case, according to the numbers shown in the diagram. This field is not exposed to the end-user. The pros of this trick are very significant:

  • Already in the early stages of the implementation we can check the statistics, meaning how many times each case happens and see if we got what we expected. This can help us find new things that we (and sometimes even the users) weren't aware of and find bugs in early stage of the development.
  • Each record can "tell its story" and tell us "why it was created". This saves us a lot of effort when we want to question the data and find the source of the problems.
  • During the tests we check two different things: Whether each record was "identified" correctly by the ETL (by checking the case number) and whether the values entered into the other fields are correct. While once we used to check only the values correctness and then find the problems, now we can detect the problems much faster.
After we came up with this idea, I thought of another one: Why won't we expose this new field to the users/analysts and let them "enjoy" this knowledge? Maybe these statistics can also help them with the data analysis? This requires us to build "cases" dimension and add a description for every case number. I don't think that this is a good idea in any circumstances, but you should know that it's an option.

Tuesday, March 29, 2011 11:20:39 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, March 07, 2011

This post is the fifth (and last, I think) in a series about HR BI systems. In the previous post I've shown how to make the BI system smarter by cleaning records and finding Super-Transactions. In this post I'll complete a few last points and introduce you to some interesting reports I built.

Finding Super Transactions

In the last post, I introduced you the concept of Super Transactions (ST). I explained how to make the ETL process find them in the raw data and how to model them in the DWH and let the end-users use them in their analysis. However, I didn't explain how to find them in the initial design phase.

Objective: Find the material processes that happens to many people.
The main problem: There are many people and we don't know where to start looking! We should think about a common denominator to all the processes in order to find them.
The key observation: People are boring. Meaning, in terms of their statuses or various trailers. Think about your personal diary (in terms of what we talked about in the previous posts), or the "log" of your life. Think how many times you changed your martial status, salary, workplace status, etc. Yes, you have no more than a few records. You change something only once in a couple of years (in the best case).
On the other hand, when an employee undergoes a material change (ST), there are many changes that happens (almost) at once in different appendixes (or subjects). When the student changes his position in the university, he changes in short time his position, salary, courses and more.
Let's show it in a simple graph. The X axis is the time line and the Y axis is the number of appendixes that changed:


This shows us that most of the time our life is static and suddenly you find there are "peaks" that indicates that something "big" happened. This is where we'll look for the STs.
How we'll find them? I'll take you back to the advanced modeling of the project. For every record in the diary we added end_effective_date column which has the date of the next record minus one day. That enable us to write between statements and make our life more simple. Back to our case: using analytic functions, I added a field that says how much time passed between the current and the previous record. As I said, the number are very large in most cases. The cases where there's some records with small numbers are the cases we are looking for. It says that something important happended. Using this method I found eight out of nine (!!) which implemented in the system.

Incremental Load

In an early stage of the project we realized that it will be very "unwise" to load all the people all the time. As I mentioned, most of the people don't change every week. The problem is that incremental load of the diary ("log") of people is not simple at all. The first reason is that we need to update the end_effective_date and ind_current of the last row, but that's not difficult. The second and main reason is that updating and adding one record is not enough: in case that the man was in a middle of a ST we need to re-calculate the ST. This is why we decided to completely delete the records of the people that changed (in one or more appendixes) and rebuild them from scratch. This is why we run the ETL only on for the people that changed. This solution is very simple and easy to implement. All we need to do is to delete these people's records from all three diaries and then run the ETL only on them. Thus, the ETL process is very short (half hour total) in every night and only the first run (on all people) is (very...) long.

Interesting Reports

I'll show here some interesting reports I built for the users in order to stimulate your thinking:

  • Organizational Pyramid - the ratio, over the years, between the number of different populations. For example, the ratio of senior lecturers, lecturers and students. The report is shown as stacked graph bar. The graph answers the basic question: Do we have enough students to make them enough teachers in the coming years? Do we have enough lecturers to make them enough senior lecturers in the coming future? In this picture I built a very similar graph in Excel that shows how many students of every year we have.

  • Students Rift - For every faculty, the gap between the number of students that the university wanted to have and the actual number of students. You can change the time parameter and see the screen as it was on another date. You can also drill-down on every faculty and see the raw numbers.
  • Number of Promotions & Retirements - For every year, how many promotions & retirements were made. You can drill-down the numbers and see the names and details of the people.
  • Grades Graph - Classic line graphs othat shows grades for a lecturer or faculty over the timeline. Excellent tool to discover interesting trends such as professors who write too difficult exams or decrease in the academic level of a faculty.

To conclude: I hope you enjoyed the posts about BI HR. I got some mails regarding these posts and I'm glad to hear that I'm helping people here in Israel and even over the world.

Monday, March 07, 2011 9:31:10 PM (Jerusalem Standard Time, UTC+02:00)
 Wednesday, January 05, 2011

This post is the fourth in a series about HR BI systems. In the previous post I've shown to advanced model of the solution, the ERD of the datawarehouse and I shown how to answer most of the business questions using that model. In this post I'll show a more complex issue which is smart records filtering and Super Transactions. I will introduce you to this subject and show you how I implemented it.

To better understand this post, I highly recommend you to read the previous post first. Let's summarize the last post: We reached to the conclusion that in order to answer most of the business question regarding HR, we need to build a very wide Fact table (meaning that it will have many fields) that contains all the relevant fields from the appendixes. I introduced you to the problem of irrelevance and I shown the way to solve it using the N/A value instead of null when the appendix hasn't started yet.

Actually, after doing what I described in the last post we finished 90% of the project. Now, only the other 90% remains... ;-) The thing is that another main requirement remains (although basically it was our idea). The reason for which I didn't mention it in the previous posts is that I didn't want to bring in this complex issue in the first steps. I wanted you, my reader, to stay focused on the main structure of the solution. The requirement is to clean and improve (from the data point-of-view) the personal diary records and identify personal processes. I will explain later what this means. Note that at this point the example of the university is not so relevant so I apologize for the weird examples.

Cleaning Records

For example, let's say that we have one lecturer (or more) that has swift changes in his salary: During the month his salary changes but in the end of the month it always goes back to the original salary, meaning that he always gets the same salary in his bank account. I know it sounds weird, but I've seen a lot more confusing cases in financial systems. If we'll bring the lecturer's data to the DWH as it is, we'll have a number of serious problems:

  1. In case that the user filter the report on the date where the lecturer has the temporary salary, the user will get a wrong result because he won't see the "real" salary ("real" is a problematic word in this case. We'll discuss it later).
  2. The rapid changes in the lecturer's salary appendix will cause great growth in its diary. This is problematic because of disk space, but the main problem is that the diary that will be shown to the end-user will be "junky" with a lot of unnecessary records which will prevent the end-user from seeing the true changes in the diary. In other words, we want a cleaner personal diary.
  3. On the other hand, there are other users (especially business analysts) that want to analyze exactly these cases so we can't just delete the records...

Super Transactions

Super Transaction, or ST, is a personal process or change in the man's life which reflects some appendixes and can happen on more than one day. For example, let's say that we have a tutor that got a promotion and becomes a lecturer. By doing that, he changes his Position (tutor -> lecturer), Courses (now he teaches other courses) and the most important of all - the Salary. These changes won't have to happen exactly on the same day. Its position can be changed when he got the announcement but the salary will change only in the end of the month. From the technical point of view, we can say that the man only had some changes in some of his appendixes. But from the business point of view, he has a serious career change aka he been over a process, or a ST.

The requirement is to identify and flood these STs and even let the users investigate and filter by them (You can read more about it in Kimball's book).

So, how we'll make it real? In particular, how we'll let the users see both the original and the cleaner data? First things first, so let's talk about the big picture - the model of the solution.

General Model

To fulfill these requirements we'll define three work modes. The user can switch between them at any time.

  1. Full Mode - In this mode, the user see exactly the data of the source systems. The records filtering and the ST detection won't take part here.
  2. Filtered Mode - In this mode the user will have cleaner records and he will see the ST.
  3. Integrated Mode - The user will see both the original data and the filtered records, so he'll can compare them and have a better understanding of how the system works.

Let's wear the hat of the DWH analyst. If we'll want to add a filter to the table according to the work mode that user has chosen, we'll have a major problem. Every record can be belonging to more than one mode. A record that won't be filtered belongs to every mode. A filtered record only belongs to the filtered and the integrated mode. ST record (we'll talk about it later) also belongs to the filtered and the integrated mode. The naive solution says that we'll add three indicators (boolean columns) that will say for which work mode this record belongs and we'll filter the table using these columns according to the work mode the user has chosen. The problem is it will cause great performance problems so we have to create indexes for these columns. But, creating indexes for boolean records is not very affective because the variance is extremely small.

The solution we found is to split the diary into three tables, one table for every work mode. When the user changes the work mode, the connection will change and the system will take the data from another table. The great benefit in this solution is the performance gain. The query is mapped on the appropriate table and we don't need to filter according to the work mode. In order to implement this, you need a reporting tool that can dynamically change the connection, but I believe that any reporting tool can do it after some workarounds. In Reporting Services, you can do it by defining a Stored Procedure as the Data Source. The SP will get the required schema/table as a parameter and use it in the query (you can see code here). In OBIEE, you can do it by using value-based fragmentation (instructions here). The disadvantage is that you take unnecessary disk space, but we prefer performance over space.

The ETL process now has three targets, one for each work mode. The filters inside the ETL process define in which table, or work mode every record will be loaded.

Cleaning Records

Let's start from the most important thing: It's very important to make the customers understand that our BI system is not a artificial intelligence system and it can't guess which records need to be filtered. We need to make clear that the system will work by predefined rules. This is why the preliminary step is the most difficult: We need to define the rules, test them, improve them, test again and so on. A good example is that: If a person changes its status (in almost any appendix) and then changes it back or change it into another status, this is a "dirty" record which needed to be cleaned. This simple rule can filter a lot of noise in the data. It happens a lot of times that a user of the source system changes something, see that he has mistaken and turning things back. The problem is that some of the system only add rows and don't update them is such cases. This is why it makes sense to filter these records.

After we make sure that we and the customers understand each other, we can start with the implementation. This can be very easy or very difficult, depends on the ETL tool you're using and especially on the complexity of the rules you defined. While the ETL tools usually look on one record at a time, we need to compare the record with its predecessor, the record after it and sometimes even more. In this step, I recommend you to stop, take a paper & pen and start to think how to implement it before you're opening your ETL tool. The most important thing is to check that the results satisfy your expectations. Maybe you'll have performance problems here, but you can come back and handle them in the final steps of the project (personally, I prefer dealing with data correctness in the first steps and performance in the last steps). As I said, after you decide (in the ETL process) whether it's a regular, filtered or ST record, you'll route it to the appropriate target(s).

Super Transactions (ST)

This is the most difficult part of the project. Take a deep breath.
I'll remind you what it is from the business view and this time also from the technical point of view. ST is a personal change, or process that happened to a man. This change will often be reflected in more than one appendix and can happen on more than one day. Remind yourself with the tutor that became lecturer. The salary change happened weeks after the position change. The requirement is to flood that a big change happened here and not only some changes in some appendixes.
Let's go on with the technical stuff. There are some basic questions regarding ST:

  1. What will happen with the records that are contained inside the ST? contained record is a record which is a part of the change (salary change or course change in our example). If all the changes happened on the same day, will we mark the record as a ST record or maybe we'll add a new ST record? If the changes happened on more than one day, will we add a new ST record or maybe we'll mark one of the records as the ST record? Which one? What will happen with the rest of the contained records?
  2. What will happen with records that occurred on the same time as the ST but don't belong to it? For example, let's say that when the tutor became lecturer he also became a father (Mazal Tov!). The diary will have a new record for this change, but it doesn't belong to the ST. What we'll do with this record?
  3. Will we run the ST detection before or after the record cleaning? It will change the results dramatically.

Enough with questions. let's start with the answers. We thought and got to the conclusion that there is a strong connection between the work modes and the ST.
We defined the full mode as the mode where the data is exactly the same as the source data. This is why we won't clean records here and we won't mark any ST records. We want to keep the balance with the source systems. Our greatest threat is that the users will say: "Hey, this is not what I see in my (source) system. This whole BI is crap!!!"

On the other hand, the filtered mode is where our system becomes smarter. Here we can add the record filtering and the ST detection. This is why here we'll mark ST records. Why mark and not add a new record? Because this will break our most basic rule - one man cannot have two records with the same date. This is not just a technical problem (we can add Is_ST column and add it to the table's primary key). This is a conceptual principal. We added a new column named ST_Name.(If it's null then the record is not a ST record).

Which record will we mark as the ST record? Very simple: As part of the design stage, we'll decide (along with the customers) for each ST which is the leading record and we'll mark it. Often, this is the record that will help us to detect the ST. Going back to our example, the ST record is the one where the tutor changed its position. In this record, we'll make ST_Name=Position_Change. Don't forget to mark the contained records as filtered and in the what_changed columns in the ST record, add the other contained appendixes. Back to the example, Write in the what_changed column that also the courses and salary changed as well. We'll also update these appendixes' column in the ST record with the new values.

What about not-contained records (such as the family status change)? Our solution completely covers that. The ST detection only handles with the appendixes that we defined as contained in the ST. Even if the change happened in the same date as the position change, it won't affect anything at all.
Regarding the order of the actions (ST detection or records filtering) - We thought about it and came to the conclusion that it is better to first clean the records because the ST detection process needs to see the best and most clear picture as it can for a better ST detection. We've been glad to see it's right in the tests we did.

Summary

The record cleaning and the ST detection that the BI system does is a quantum leap. The system does not only reflect the source data, but it also add an "artificial intelligence" layer which helps the user navigate the mixture of data. As a matter of fact, this makes the system turn data into knowledge. On the other hand, it doesn't forces itself on the end-user and it enables him to see the original source data. It's important to say that the design is a dynamic and "breathing" process that doesn't end on Day 1 of the system. As time goes by, things and business rules are getting clearer and better so we come back and sharp the rules.

In the next post (the last of the series in my opinion) I'll discuss even more interesting issues such as differential load, how to find ST in the design phase and some of the most interesting reports you can do with our HR DWH. Stay tuned.

Wednesday, January 05, 2011 6:29:09 PM (Jerusalem Standard Time, UTC+02:00)
 Sunday, November 14, 2010

This post is the third in a series about HR BI systems. In the previous post I've showed the basic design problems. In this post I'll show how we solved them. I recommend reading the previous post before reading this one.

I finished the previous post with a lot of questions and zero answers. I have shown that the classic star-schema won't work here because it can't answer the most basic questions in the HR business world. At this point we decided to stop and make the most reasonable move - use the wisdom of others which dealt with similar problems. We opened the classic book of Kimball, who's considered to be the father of DataWarehousing. I sat for a while and read the HR chapter. Between the lines, I recognized that Kimball had the same problems we did, which brought him to model the DWH as follows:

Instead of classic star-schema, we'll build here a very wide table which we'll call the Diary. This Diary will contain all the appendixes. For example, in our case (look at the previous post) the fields are: ID, start_effective_date, end_effective_date, first name, last name, more static fields (which don't change with time), department, more fields from the department appendix, position, more fields from the position appendix, course, more fields from the course appendix, payroll, more fields from the payroll appendix, IsCurrentRow. The basic logic of this idea is that each row was created because there's a record in one of this person's appendixes which was created in that date (or maybe in more than one appendix). We take the data from the record in the corresponding appendix and bring the relevant data for that date from the other appendixes. For a better understanding, I'll show a little example. Here are the four appendixes: (Note that for the sake of the example, a student/lecturer can have only one course at a time)

Department Appendix
012 01/01/2000 History
012 01/06/2005 Computers
Position Appendix
012 01/01/2000 Student
012 01/01/2006 Lecturer
Courses Appendix
012 01/01/2000 Middle Ages
012 01/06/2005 Computer Basics
Payroll Appendix
012 01/02/2006 Basic

After making the "merge" of the appendixes this is how the Diary will look like:

ID Start Date What Changed Name Payroll Department Position Courses End Date
012 01/01/2000 Department
Position
Courses
Miky Schreiber History Student Middle Ages 31/05/2005
012 01/06/2005 Department
Courses
Miky Schreiber Computers Student Computer Basics 31/12/2005
012 01/01/2006 Position Miky Schreiber Computers Lecturer Computer Basics 31/01/2006
012 01/02/2006 Payroll Miky Schreiber Basic Computers Lecturer Computer Basics 31/12/2999

So what have we done here?

We built a diary that combines all the other appendixes. For each record, or point in time, you can see what was the situation of the person from every appendix's perspective. That is, if on date X only the department appendix was changed, you can still see what was the situation in the position appendix. Hence, you can ask business questions that combines several appendixes even when these questions are aggregative. For example: how many students which have the lecturer position get paid basic salary? I invite the reader to read the business questions in the previous post and see how you can answer them using this model. You must understand that with this model you must always fetch only one row per person or otherwise you'll have wrong results. In order to do that you need to filter the table on a particular day (using expression like: where myDate between StartDate and EndDate) or use IsCurrentRow=1 (you can calculate this column in the ETL).

How did we do that?

I won't cover here the entire ETL design, but I'll explain the key principles: First, we build the avenue. The avenue is a small table with only two columns - the ID of the person and the StartDate. We make distinct query on all the appendixes and get, for every person, all the start date from all the appendixes and this is how we build the avenue. Once we have the avenue, we join it with every appendix, and this is how we bring the values from the appendixes themselves. Now, for each start date we have the data of the appendixes where this start date came from. In order to "complete" the data from the other appendixes, we "spread" the missing value from the previous row. For example, on 1/6/2005 the position status hasn't changed so after the join the Position field will be null, so we take the value from the previous row.

The problem of Irrelevance

After making the above, we had a problem that floated very quickly. As always, each column has a foreign key to its dimension table, so the values cannot be null. The problem is that sometimes the values must be null - it happens in the first rows of each person, when not all appendixes has their values yet. For example, look at the payroll column in the "diary" table above. It gets a value only in the last row. We can't have null in this field because we want to have foreign key but we don't have any value yet. What can we do? The solution we found is a "N/A" value which we added in such situations. Note that you need to add a new row with the "N/A" key in every dimension table.

Is that all? of course not! Using this "diary" table we can answer 90% of the business question we faced. This table is basically the whole project. But, there are more challenges we have to solve. I'll leave it for the future posts, where I'll show how we can make the HR BI system smarter.

Sunday, November 14, 2010 10:11:48 PM (Jerusalem Standard Time, UTC+02:00)
 Wednesday, October 13, 2010

This post is the second one in a series about HR BI systems. The previous post dealt with pricing and prioritization of the project. In this post I will describe the initial modeling process - familiarity with source systems, the main/major problems and their effect on the ERD model.

When starting with a new BI system design, the first question to be asked is: What are the business questions that interests the customer? It depends, of course, on who is the client, but that was covered in the previous post. Unfortunately I can't reveal the content world for whom I built the system and therefore to I'll use a University's content world as example. Almost all the questions that the customers wanted to get answered had totals that combine different subjects. For example:

  • What is the average salary in each department?
  • How many professors and students were in each department in each year in the last five years?
  • What is the ratio of students / lecturers in each department in the last years? (Graph)
  • How many transitions between departments (by students and lecturers) took place over the years? Which department is the most abandoned and which is the most popular?
  • Display events of lecturer / student from admission to the university until this day
This list is only a small part of what our customers wanted to know. Like always, there were no answers to these questions because it required integration between multiple sources. While getting familiar with the content world, I noticed several key points that will be treated later on in the project (and in future posts):
  • There's a very difficult problem of terminology in the organization ("one truth"). For example, the most basic terms - a student and lecturer are not consensus. For example, what happens when a student is also a lecturer? What is the lecturer's department if he lectures in more than one department? If a student moved to another department and returned after a semester is it considered to be one, two transitions or maybe none? Of course, different people had different answers to these questions.
  • All sources came in the same format - appendix. For those who not familiar, appendix is a table containing the ID, date and nature of change (kind of a diary on a given subject, "Nigreret" in Hebrew). That is, each record shows only the new / current situation (but the table is historical). When you want to ask a question relating to only one appendix (salary, for example) there is no problem. The problems arise when asking questions that combine several appendixes (salary and department, for example). There is a real need for carrying out the integration of the appendixes. Well, This is what Datawarehouse meant to do.
  • We did not find the Fact table. On one hand it seems that every appendix is a fact table because it describes changes / events that took place in the world, but ... Where are the measures? On the other hand, appendix looks like a Slowly Changing Dimension (SCD) but ... Where's the fact table? No doubt that this is a Factless Fact.

So we're got familiar with the main questions, the sources of information (same format - easier for us) and the unique attributes of this project. Now comes the hard question - how we'll design the data warehouse? Our initial solution looks like this:

In other words - classic Star Schema. The appendixes are SCD and the Fact table is in the center of the star. The fact will contain only one record per person and it will contain only static data of the person, such as ID, name and date of birth. This model takes us back to the questions about how the linkage between dimensions and fact will be made and what is the measure the fact. We tried to think how we answer the most basic questions required by the system (by the way, this is the way to see the quality of the ERD):

  1. For each lecturer, what is the salary, department and current position? - We'll take from every dimension/appendix the current record of the lecturer. The link between the dimensions will be carried out by the fact (lecturer's record) using the ID column
  2. What is the salary, department and position of the lecturer on date X? - From every dimension we'll take only the record which its starting date is from before X and nearest to X and again we'll link the dimensions through the fact table. Hence, we should calculate analytically (analytical functions) for each record its end date (the day before the start of the next record) and then it's easy to query with those kind of questions by using "between" statements.
  3. Average of the salaries in each department - we'll pull out the list of lecturers in each department (from the department appendix). We'll choose a date that interests us, and for each person we'll pull only the record which contains that date. Then we'll do group by. Ostensibly it's not trivial in every classical reporting tool, but let's think for a moment: what do we do here? Group by members of a particular dimension by another dimension counting, where dimensions are linked through the fact table. Looks fine. So - What's the problem? The problem is that we are trying to aggregate/sum by a dimension's attribute and not by a measure, which is not possible in every reporting tool. (a good solution will work with many tools and not just with dedicated tool for a particular ERD)
  4. Questions involving only one appendix - very easy and do not require datawarehouse at all
  5. Show events from the first day in the university until that day (of a certain person) - not possible in classic report tool. Let's say we want to see the "event log" of the lecturer Y, sorted by starting date in ascending order. This requires a union of all the appendixes / dimensions and sort of all records by starting date, where there may be several different appendix records that start in the same date (even very possible. change in the position usually leads to a raise)
At that point we realized that we needed to retreat. That looked like a good ERD at first but the last question is very basic and our model proved wrong. How do we show the event log which is a combination of the appendixes? How can we identify processes when they occurring, such as increase in the amount of courses and salary which leads to a better position? These are all questions that the system should give them an answer and a classic star schema can't help us here.

What have we done? Continued next post ...

Wednesday, October 13, 2010 11:39:33 PM (Jerusalem Standard Time, UTC+02:00)
 Sunday, September 12, 2010
My next posts will discuss the project I just finished (at least its first phase) which is about HR (human resources) BI. In this post, I'll describe the data dictionary feature - why and how we did it.

From the first day of this project I've seen that data dictionary is extremely necessary. Different people uses the same term but they mean different things. There is no consensus about most basic term in the HR world - employee. I asked different people the simplest question: "What is an employee" and I've got different answers - Only full time employee; only one that gets the payment from me, meaning no outsourcing or contractors and so on...

The solution I thought about is a simple but strong data dictionary (aka DD) which contains all the terms and their definitions. The DD is maintained only by one person who is the main customer of the BI system. This is because letting it be edited by more than one person will make the mess rise up again. Now, let's continue to implementation:

Step 1 - Creating the DD
I created a new list in MOSS/SharePoint. The list has only the following columns: Term, Definition. That's it! As I mentioned, only one person has the right to edit this list. Using the help of our infrastructure team, I made a small ETL process which takes the DD from MOSS and writes it into an Oracle table in the DWH. This DD-ETL runs in parallel with the BI ETL and won't fail the whole process if it fails.

Step 2 - OBIEE Administration Tool
In the OBIEE Administration program, import the DD table to the physical layer. Now we have a slight problem - the DD table is not connected to any other table in the model because it's not belonging to the star schema and OBIEE doesn't accept it. To solve it, we'll create a new view (new table -> properties -> Table Type: Select) in the physical layer which will only contains the terms from the DD table (select term from DD_Tabel). We'll call this table/view DD_Dummy and connect it (foreign key) to the real DD table using the term column. Take these two tables to the logical layer (aka Business Model) and only pull the real DD table to the presentation layer. Now we have a working model with the DD table in the presentation layer.

Step 3 - OBIEE Dashboard
If you had read my previous OBIEE posts, you should know by now how to add Javascript code to your dashboard. Add the code in the attached js file and add the attached css too (edit the file res/s_oracle10/b_mozilla_4/custom.css in the OBIEE server).  Create a new report which shows exactly the DD table. In every page you want to have the DD feature, do the following:
  • Add a new section and hide it (Properties -> Format Section -> Use Custom CSS Style -> display:none)
  • Add the DD report into the hidden section
  • Below it, add a text element and inside it write:
    Mark the "Contains HTML Markup" checkbox
  • Add a new hidden text element anywhere you want which will contain the script (you can use the same hidden section if you want) and put there the following script:

That's it. If you did everything right, the magic will start immediately. Moving the mouse cursor above any of the column headers in every table in your page will popup a little tooltip with the definition of the column header. Note that if you already have my old javascripts in your OBIEE you might have some functions duplicated. Delete the duplicated functions manually.

DataDictionary.css (.36 KB) DataDictionary.js (2.42 KB)
Sunday, September 12, 2010 8:32:16 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, May 10, 2010
From time to time I find myself writing a hierarchial query in Oracle DB. It helps me a lot when I need to understand hierarchial data, mostly resides in the source systems of the Datawarehouse. Yesterday I've seen that a lot of my colleagues haven't heard about these kind of queries, even the most experienced ones. So here is a tiny tutorial for this subject which you can go back to any time. The important thing here is that you'll know that these kind of queries exists even if you'll google it the next time you'll need it (search for "start with" or "connect by").
Let's say you have a table which looks like this:
Key Father Description
1 10 Renault
2 20 Jambo
3 20 Airbus
4 10 Hyundai
5 10 Toyota
6 10 Honda
7 20 F16
8 20 F15
9 10 Alpha Romeo
10 -1 Cars
20 -1 Airplanes
-1 -99 Transportation

You want to see the data in a tree form, meaning that you'll see Renault, its parent, its grandparent and so on. All you need to write is the following query:

select * from MyTable
start with KEY = 1
connect by prior FATHER = KEY

That's it. The result is:
Key Father Description
1 10 Renault
10 -1 Cars
-1 -99 Transportation

Note that the recursion stops only when it cannot find the next parent. In addition, many times you'll want only one row from the tree (the root, its son or some other criteria). In this case you can simply add a where clause to the query.
Unfortunately, doing the same with SQL Server is much more difficult (tell me if I'm wrong).

Enjoy.

Monday, May 10, 2010 10:57:55 PM (Jerusalem Daylight Time, UTC+03:00)
 Wednesday, March 03, 2010
One of the basic requirements of a modern BI system or Data Warehouse is to be non-volatile, meaning that the DWH will save the history. In this post I'll explain how to implement it in the DWH while assuming that the history can be changed and (the more interesting part) how to implement it in Analysis Services.

Before starting with history saving in the DWH, I need to say that this has nothing to do with Slowly Changing Dimensions. You can work with Type 1, 2 or 3. All you need to take care of is that the fact table will be consistent with the dimension tables. Assuming that the history can change (this is the scenario in my organization), we need to extract all the fact data every time. If you work with staging level/environment/layer you can use it to calculate which records has been changed or inserted and only upsert them to the DWH itself (it's more secure). In our case, we'll work without such staging level so we need to load all the fact data in every load to the DWH.

Loading the fact in the ETL is very simple - we'll save the datetime of the execution start, round it and put it in every row in the fact table. We can call this field CreateTime. You probably ask yourself why we need to round it. The answer is that we don't want the time dimension to have every possible value of datetime. It's huge and the users don't need that kind of accuracy. That's why I round it in my projects to half-hour, meaning that 14:27 becomes 14:00 and 14:50 becomes 14:30. Let's look at an example of such fact table:

RowNum

Dimension1

Dimension2

Measure1

Measure2

CreateTime

1

X

A

4

20

03/03/2010 15:00

2

X

B

56

19

03/03/2010 15:00

3

Y

B

2

3

03/03/2010 15:00

4

Y

B

1

10

03/03/2010 15:00

5

Y

C

7

5

03/03/2010 15:00

6

X

A

4

20

03/03/2010 15:30

7

X

B

56

40

03/03/2010 15:30

8

Y

B

2

3

03/03/2010 15:30

9

Y

B

10

10

03/03/2010 15:30

10

Y

C

7

5

03/03/2010 15:30

Here we can see two executions of the ETL: The first between 15:00 and 15:00, the second between 15:31 and 15:59. Note the differences in rows 2 vs. 7 and 4 vs. 9. This means that the source table(s) has changed between the executions.

In the DWH, we create foreign keys from Dimension1 and Dimension2 to their dimension tables. We'll also create FK between CreateTime and the time dimension. I won't get into the time dimension design here, you can read about it a past post. I assume that the time dimension has the hierarchy Year -> Month -> Day -> Hour -> Half-Hour. It seems that now we have a good fact table for use in Reporting Services (or any other reporting tool). Create a parameter called Extraction Time/ETL Time and assign it values from distinct on CreateTime. The user can pick a date and the report will filter on this date. Perfect.

The problematic part starts in Analysis Services. Only one extraction time can be the current member. There is no sense in summing or aggregating more than one extraction time. There are two ways to deal with it:

The wrong way - I'm writing this down so you'll learn from this design mistake. We created a calculated member which takes the last extraction time:

Create Member CurrentCube.[Extraction Time]. [Calendar].[All].[Last Data Time]
AS
Tail ([Extraction Time].[Calendar].[HALF HOUR].Members,1).Item(0),
Visible = 1;

In every Panorama view, we sliced the Extraction time on the [Last Data Time] member, which will show the user the data of the last ETL run. Our basic assumption was that the user wants to see the current data and when he'll want to see the historical data he will pick another member from the Extraction time dimension. This is the point we missed and it soon  enough bumped into our face.

We forgot that every system, even BI system which is built for analysts should be intuitive and idiot-proof. The users picked members from the day level in the Extraction time dimension. When we asked them why they did it they said that "we wanted to see the data from 03/03/2010 and suddenly the numbers got bigger!". That happened, of course, because the member 03/03/2010 is the sum of 03/03/2010 15:00 and 03/03/2010 15:30. The solution is that no matter which member the user picks in the Extraction time dimension, he should see only one leaf (half-hour level) member at a time. This takes us to

The right way - We set the [Last Data Time] member to be visible = 0 (I kept it for debug purposes) and I written MDX script which I call NonLastEmptyLeaf. This script makes sure that no matter which member in the Extraction time hierarchy will be picked, the user will be sliced only on the last non empty leaf. For example, if the user picks 03/03/2010 he will see the data of 03/03/2010 15:30. This is what will happen also when he'll slice on 03/2010, 2010 and even the [All] member. This is the MDX script:

Freeze([Extraction time].[Calendar].[All]);
Scope(
  Descendants (
   [Extraction time].[Calendar]. [All],
   0,
   SELF_AND_AFTER
  )
);
  this = Tail (
   Filter(
    Descendants(
     [Extraction time]. [Calendar].CurrentMember,
     [Extraction time].[Calendar].[HALF HOUR],
     LEAVES
    ),
    [Measures].CurrentMember > 0
   )
  ).Item(0);
End Scope;

Now, there is no meaning of [Last Data Time] member because slicing on the [All] member will show the current data, meaning that no slicing at all will show the current data because [All] is the default member.

To conclude, history browsing can be dangerous if we let the users the option to slice on more than one data time. The open question that I have is how can I prevent the user from picking more than one member in this hierarchy in Panorama views? Should I even try to do that or should I count on his minimal common sense?

Wednesday, March 03, 2010 9:34:48 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, October 13, 2009
Two weeks ago I did a talk about SCD. Here is the presentation.
I think that even if you're doing BI (design or development) for quite a while you'll find this presention resourceful and maybe you'll learn a thing or two...
Have a nice learning!
Tuesday, October 13, 2009 11:38:41 PM (Jerusalem Standard Time, UTC+02:00)
 Thursday, September 24, 2009

Last August, I've seen an Informatica webinar featuring Dr. Ralph Kimball (you can see it here). The topic of the webinar was "Reverse the flow of Data Warehouse Design to Ensure Success". In order to make long things short, what he says is that in the DWH's first steps you don't and can't know what are the full requirements. This is quite problematic, because the first steps of the DWH - its design - are the most crucial steps, where you take the most important decisions which will be hard to change later. His solution is revolutionary and sounds a little bit hallucinatory - Start with the website. After you have the front-end finished, whether it's a dashboard website or just a simple tabular report, and you have the customer's approval, you can start the DWH design and you know exactly what you should do and what you should not.

Thinking of that, we decided that he's right but it's too much. The cost of building a website with a data source behind it which will be changed 100 times in a short period is very high. Instead, we're working in a defferent method - We are working with Mockups.

Our mockups are PowerPoint presentations which mock the dashboard website that will be the project's front-end. We're using a lot of buttons and hyperlinks between the different slides in order to create a feeling of a real website with navigation and interactivity. When we show it to the customer and tell him that this is what his dashboard will look like, at first he's very happy and then the new requirement's phase begin: Can I have this button here? This one more slicer there? What about replacing this chart with this one? etc. In PowerPoint presentation, making these changes is very easy and after a couple of days we have a good understanding of how our front-end will look like. Threfore, we fully understand how our DWH will look like. This is a great method, I recommend you'll start using it in your next project. Try and see for yourself!

There are some more points you need to know before starting using it:
  • You're not have to use Power Point for this. Visio and even Paint Brush can assist you. Nevertheless, I recommend you to use a tool for this (remember that the mockup will be changing a lot). There are many free & open-source tools to do it.
  • You may consider to have the mockup to be your contract with the customer. On the other hand, the final result may look a little bit different. You can draw everything in PowerPoint but in the reality things may be hard to implement.
  • After finishing the design phase, don't go over and over to the mockup to keep it updated. Again - The reality is stronger than the first design. When the project will be finished, no-one will care about the mockup.
Thursday, September 24, 2009 9:49:44 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, July 20, 2009
The BI team and the source system team has a great bond between them. When the source system's team wants to change the system's logic or the physical tables, they must inform the BI team as soon as possible, because the BI team must change their system too. Otherwise, the ETL will fail and the users will see old or no data, or in the worse case the users will see incorrect data.

On the other hand, when the BI system shows data which is not the same as the data in the source system, the responsibility of showing that the BI system is OK is on the BI team's shoulders. It needs to investigate the problem and see if the error is on the BI side or in the source system side. Maybe there is no error at all and the difference is caused by difference in the logic.

Sometimes, we (the BI team, of course) investigates the data problems and find that the error resides in the source system itself. Then, we need to make sure that the error is fixed in the system, whether by the IT staff or by the power users. The important point here, and this is why I decided to write this post, is that the change must be done first in the development environment of the source system. Then, we'll run the ETL process and see that everything is fixed. This is important because sometimes we think that the change will make things right and it's not. That's why everything should be changed in the development environments first in both systems.

Only after we've seen that the data is fine, we can make the change in the production environments. Last tip: Make sure that the change is also made in the source system's production environment. Sometimes the time interval between these events is long and the source system's team just forgets all about it. Write it down as a task before going to production.

Monday, July 20, 2009 7:24:28 PM (Jerusalem Daylight Time, UTC+03:00)

Last Week we had a funny case which took us much time to understand.

In our development environment, we like to have the same source data for a long period of time. This way we're familiar with our data and we can expect the numeric results in the higher layers (OLAP, dashboard, etc.). When we have a stable environment and we think that our system gives the correct data, we ask our infrastructure team to refresh our sources, meaning that they bring new production data to the development environment.

The situation that we had last week is that we refreshed our sources and from that point the data didn't make any sense. We asked the infrastructure team if they're sure they refreshed the data and they said they are sure. We checked the system and the data many times and seen that the data is different from what we had but still it didn't make any sense.

After much frustration, one of us made a suggestion to write sql query in this form:
select max(some_date_field) from fact_table

Then we were all amazed. After much time of sisyphean work we realized that our data is not new at all. Moreover - our old data was old too. The maximum date in our main fact table was half a year ago. The reason was that we took our data from a legacy system which runs a job that exports their tables into files, and our infrastructure team imports these files into our source tables. The problem was that no-one ran the job so the files themselves were old, meaning that our refresh activity means nothing.

Conclusion: Never trust your sources. Always run a sql query like mentioned above to check that you have recent data and that the data you think you have is the data you have. You can think of more queries that validates that what you have is what you expect to have. Run them every morning to make sure you're ground is stable. It's only a matter of seconds and it's worth it.

May the source be with you.

Monday, July 20, 2009 5:50:02 PM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, April 07, 2009
Very easy. In Sql Server:
Select top n * from my_table
In Oracle:
Select * from my_table where rownum <= n

It can be very useful in many many cases. For example, you're designing DWH over a system and you're looking at a certain field in one of its tables. You want to know which values this field contains, but fetching "select distinct my_field from my_table" takes too much time. Instead, if you believe know that the data is well distributed, you can use "select distinct my_field from my_table where rownum <= n". Use 1000 for n in the first trial and add one zero in the end of the number n every time until you got a query that takes too much time than you want to wait. after you got the n you can live with, can use the values you have in your query result.

Wednesday, April 08, 2009 5:54:28 AM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, March 22, 2009

There are some things you can only learn in the hard way. It didn't happenned to me personally but to my team friends, so I consider it as it is my bad.

We upgraded our ETL tool - Informatica, from version 8.5 to 8.6. We had to run some tests to see that the results are the same. So, what we did is to save the result table from 8.5 in Excel, save the result table from 8.6 in Excel and then compare them using Excel-built-in functions. The only problem is we didn't pay attention to the places where zero and null interchanged. This happened because the two versions act differently where null values take places in aggregation functions. For example, when there is a sum function and it aggregates only on null values. In one version the output is zero and in the other the output is null...

This can also happen in other tools and technology. For example, in OLAP cubes the difference between zero and null is the difference between seeing the member of the dimension on the screen and not knowing of its existence.

For conclusion, always be aware to this point and don't forget to check it.

Monday, March 23, 2009 6:09:59 AM (Jerusalem Standard Time, UTC+02:00)
 Friday, October 17, 2008

This month we're really busy with a very important project and a short schedule. This made me think of ideas for agile development for BI, but I'll leave it for other time for now. In order to make us better BI developers, I decided to take one Pragmatic Programmer principle and use it. I took one of the most important (for my opinion) principles - DRY (Don't Repeat Yourself). The DRY principle says that "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system". In classic programming it's simple to use: Use methods and generic classes to implements logic that will repeat itself in the project. But how do you do it in BI development? Here are some ideas I thought and even implemented some of them in my environment. Every layer/step in the BI development has it's own bulletin. I'll be happy to hear more from you.

  • First of all - use functions in your DataWarehouse's database. Do it as much as you can. Do not repeat any logic twice or more, no matter if it's in procedures, views or even CLR functions.
  • We all have much logic that repeats itself in the ETL process. For example, we found ourselves doing over and over the next process: When we build a fact table, we take every cell that points to a dimension table by a foreign key and "looking up" if it's found in the dimension table. If it's not there we replace it with Undefined, UD or null. That makes us feel very bad because we feel that we're doing the same all the time and it gives us the feeling of machines rather than programmers. The solution for this problem (and many other) is to build our own tasks (in SSIS) or transformations (in SSIS & Informatica). Alberto Ferrari did a beautiful work in this field in SSIS. I'll add some transformations of my own once I'll have release-ready versions of them.
  • My co-workers just loves the Calculated Member feature in the Data Source View in SSAS. In enables them to make a new column without making a view and with no touch in the underlying database. The problem here is that after a while we have a LOT of calculated members, many of them repeats themselves and when you look for logic you lost, you can look for hours in the never-ending DSV. The solution here is not using calculated members at all. Put all your logic in the database (and as I said - in functions). The only place where you should use calculated members is where you must - when you have no write permission to the DataWarehouse or when you build your DSV over an operative database and you don't have write permissions.
  • The same is with Names Queries in the Data Source View in SSAS. Don't use it.
  • There's much logic that you can do only in MDX. Here, the problem is that MDX scripts are defined over cubes and not over dimensions, meaning that if a dimension has MDX logic you have to repeat it in every cube's MDX script. The solution is to add the MDX programically using AMO. Every time the ETL process ends, it should run a program that takes the MDX script from a single file and place it in every relevant cube. I know it sounds a little bit wacky and I even didn't do it myself, but for what I know, it's the only solutions for DRY in MDX.

As I said, I'll love to hear your ideas about this topic.

Friday, October 17, 2008 9:46:22 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, April 22, 2008
I'll start from the bottom line: If you create your Data Warehouse and you follow the DW rules, you're life will be easy (assuming you know the semantics and the way to build a good and correct DW).
In our case, if you have a dimension and you want to make a Parent-Child hierarchy, your life will be easy if you built the dimension's table in the right way.

For example, let's look at a simple Time dimension:

Time_KeyDayMonthQuarterYearLevel_Num
01012008
01
01
1
2008
1
30122008
30
12
4
2008
1
01/2008

01
1
2008
2
12/2008

12
4
2008
2
Q1/2008


1
2008
3
Q4/2008


4
2008
3
2008



2008
4

As you can see, this time dimension's table contains the days 01/01/2008 and 30/12/2008 and their parents in the levels: month, quarter and year.
Now, let's say that I need to take this dimension and make it a Parent-Child table. This is very simple. Just create a view with one new column which will be the parent column. This is the new column's code (in Pseudo-SQL):
if (level_num = 1) then Month + "/" + Year
else if (level_num = 2) then "Q" + Quarter + "/" + Year
else if (level_num = 3) then Year
else null

The result:

Time_KeyDayMonthQuarterYearLevel_Num
Parent
01012008
01
01
1
2008
1
01
30122008
30
12
4
2008
1
30
01/2008

01
1
2008
2
1
12/2008

12
4
2008
2
4
Q1/2008


1
2008
3
2008
Q4/2008


4
2008
3
2008
2008



2008
4
null

This is it. Now, in your OLAP DB, just configure the new column as the parent and you have a Parent-Child hierarchy. In Analysis Services you even don't have to create a view. In the Data Source View you can add a named calculation and put your code there.

See? When you create your DW according to its rules, the life is easier. In this example, if you created rows for every level in the hierarchy and created a descriptive key - everything is great.
This can help you in many scenarios. For example, when you find that your dimension is not balanced then you might want to make it a parent-child, so you won't have many pseudo-levels when the only relevant member is the leaf. Otherwise, this can be very annoying to your user. In SSAS, make sure you don't use it too much, because it is bad for performance.
Wednesday, April 23, 2008 4:06:32 AM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, January 15, 2008

Although SSAS willl let you use them, some other application such as Panorama won't function properly. I'll give one example: When you perform Drillthrough in Panorama, on the fly the engine gererates a web page that will take the user to the next view. The next view will be sliced as the current view, so this web page needs to pass the dimensions parameter (meaning - the current slices). That's why this web page contains this line:

AttachParameters("Slicers","%Slicers%")

A big problem will occur if one of the sliced dimensions will be sliced on a member with a name that contains inverted commas ("). The JavaScript will result an error because there are three inverted commas in the second parameter of the line. This is only one example of what can happen in a BI consumer program if you'll use special characters in member names. So - be careful not to pass these characters from the DW (build the ETL so it will drop these characters) or giving those names in SSAS, such as the All member name.

Tuesday, January 15, 2008 11:08:32 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, January 08, 2008

I feel like I don't have the right to write about it after so many bits of information were moving in the web about this subject but I would like to add my point of view (or in fact - my point of code). First of all, I must mention some of those who written about this subject before me. Mosha Pasumansky wrote a long post about it in last may. This post contains some ideas of how to come along with this problem, but none of them is perfect. In fact (as always) - there is no perfect solution for this problem. Another important source of knowledge can be found here in the MSDN forums, where Chris Webb, Thomas Pagel and others discussed it. Now, I would like to add my solution. Take it or leave it - your choice.

First of all, create a column in the time dimension that will be the current day indicator. Thanks to the Data Source View or the UDM approach of SSAS 2005 you don't have to change the relational table itself. Just add a named calculation in the DSV with expression that will be 1 for the row of the current day and null or zero for the others. The expression syntax itself depends on the underlying DB so I won't write it here, but it's very simple. Add this column as an attribute (let's call it CurrentDayInd) in the dimension structure and select its AttributeVisible propertiey to false. That's because we don't need such Attribute Hierarchy in our time dimension. After that, create a new User Hierarchy (you can call it Current Day), where the first level in it will be CurrentDayInd and after it place the day (key) attribute.

Now, what we have here? We have a hierarchy (Current Day) with two members - 1 and zero. The 1 member has only one child which is the current day. Link that member to your regular hierarchy (it's called YSQMD at my place) and here you have it. For example, you can use it that way in the MDX script:

Create Set [Last 30 Days] as
LinkMember([Time].[Current Day].[CurrentDayInd].&[1].Children.Item(0),[Time].[YSQMD])
:
LinkMember([Time].[Current Day].[CurrentDayInd].&[1].Children.Item(0),[Time].[YSQMD]).Lag(30);

I think that this solution is good and elegant. First of all, you don't have to use external functions such as Now(). The second pro is that we use the native OLAP mechanism which saves time and makes the queries run faster. Believe me, you'll feel the difference with big cubes. The last thing is that this solution is easy to understand (at least I think so) and it is easy for maintenance. The big con is that you have to process this dimension (and related cubes) every day. I don't think that it's so bad because most of the organizations do make process every day.


 

Wednesday, January 09, 2008 5:39:58 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 29, 2007

This is the second bug I found when working with AS and Oracle as my database (you can read about the first one here). Some background about our Datawarehouse architecture before I begin to complain about Microsoft:

We cannot afford risking that our users will experience faults or crashes while we refresh our DWH, process our cubes, etc. What we do to solve this problem is called a switch: We have two schemas in our Oracle DB. While our users watch the first schema, we update the other one. Only after we finish all our load process, we switch our user's tools to see the data from the second schema. In order to implement this architecture we use Oracle's synonyms. Let's say that the users watch the fact table "sales". We have a synonym which is called sales_syn. While it's pointing to the first schema (schema_a.sales_fact), we're loading into the second schema (schema_b.sales_fact). After that, we switch the synonym so it will point to the second schema (schema_b). The users always look on views that rely on synonyms. The views never change, only the synonyms do.

The problem starts when we make Named Queries in our Data Source Views in AS. Apparently, AS looks inside the view that we enterd into the DSV, finds the target of the synonym and saves it. Even after we make the switch (the synonym points on the other schema), the named query will be pointing on the first one. All our efforts to edit the named query have failed. When we open it again, we will always find that the first schema is "burned" there and cannot be changed.

The solution we chose is simply avoid using named queries. If we need a simple calculation we can add a Named Calculation in the Data Source View and if we need a complex view over our fact table, we write it ourself in the DB. This solutions breaks a little bit the main point of the data source view (designated place for all the logic of the DWH), but it is the best solution we could think of right now.

Dear Microsoft developers - it seems that you tried to be smart and look into our Oracle objects in order to enhance the multidimensional database's performance. Next time, please think twice before you do.

Tuesday, October 30, 2007 3:24:36 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 22, 2007
I believe that every BI developer seen this in many Data Warehouses: Boolean Dimensions. As you may guess, boolean dimension is a dimension with only two members and of course with no hierarchy. For example: cash/credit card in sales cube, exists/not exists in inventory cube, etc. If you haven't seen this phrase before - relax - I just invented it. :-)
Now, the question is what to do about these dimensions:
a. Include them in the ETL process or just leave it as is?
b. If you put it in the ETL - how would you implement it?

Here's what I did in my project. You may disagree with me and I would like to see other approaches too.
a. Yes, I included it for some reasons. As every Pragmatic Programmer knows, everything can be changed so do not assume anything as globally-static. This rule takes place in here: Boolean dimensions may grow and have more members. For example, in the sales cube I mentioned above, maybe there will be another way to pay such as exclusive card of the shop (There is a network here in Israel who has it). Even male/female boolean dimension may have an Unknown member. So never exclude these dimensions from your ETL process. Wait - one more thing. You may think: Why interrupt my ETL process with these silly dimensions? If they'll grow up I'll add them to the process. As an answer think about the timings: You can never know how much time the dimension's ETL will take (although it will be very small), so in order to stay away of surprises - include it in your ETL process. just for case.
b. I implemented it as two hard-coded expressions and sent them to union. The result of this union will enter directly to the target table. In Informatica, the mapplet can't start without source table so just put a dummy table with only one row and connect it to the expression items. Why only one row? If the table will contain more than two rows then the Informatica server will consider the process as failed one.

As I said, I'll be happy to read other approaches other than mine.

Monday, October 22, 2007 7:44:45 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 15, 2007
I really think that the time dimension is the most complex dimension in 90% of the DWHs. The complexity is in two places: In the DWH design and also in Analysis Services (or any other BI tool).
First of all - why we didn't take the already-made Server Time Dimension which exists in SSAS 2005? For two reasons: The first is that the Project Real guys do not recommend using it (you can find their SSAS article here). The second is that we wanted to have some features that are not available in the server time dimension, such as Hebrew date. In a matter of fact, even if we didn't have such feature we still would build the time dimension ourselves because it's giving you much more control over the dimension. For example, you can always add some new attributes which Microsoft developers didn't think about.
I started myself to build the time dimension in excel. I figured out that this mission is little more complex that I thought it would. Most of the functions I wrote were simple, but there were some complicated ones. So here are some tips for you if you want to build your time dimension using Excel:
  • If you want to week number for every date, do not write the function yourself... Excel has function called weeknum. If you don't have it just add the function toolbox which has it (I can't recall its name right now. check in excel help).
  • If you want to have records for every level in your hierarchy (not only for days), put every level in its own excel file (not excel tab). It will help you later when you will transfer it to your DB.
  • Check yourself. Pick randomly some dates and check that all of its record has correct data.
After building the excel files I needed to transfer it to my Oracle server. I used SSIS because I didn't want to wait for my DBA to copy these files into the Informatica server (it can't use my the local files, it has to be in its server. SSIS can use local files). This also was a little tricky. First of all, close excel when running the SSIS packages, otherwise it will fail. Second, when moving the non-leaf levels, go into the columns section in the destination box and erase the irrelevant columns. It will reduce the chance for errors. Finally, click on the source box and click on "Show advanced editor". Enter the source's output columns options and define properly the columns' data types. This also will reduce the chance for errors.

I had a little bit of an argument with my DBA about how should the time dimension be. I think that the time dimension does not have to be processes at all. My time dimension is from 1960 until 2020, so no daily ETL is required. She says that all the logic has to be in Informatica so I need to develop a mapping for this dimension. I think that we both are right and that's because that in ideal world she is right. In every developers team, all the BL has to be in one place. But we don't have much time (the deadline is very close) so I won't spend the time building more mapping in Informatica when I have the time dimension already made in excel.

Maybe someday I will have the time to do this. Maybe not.

Monday, October 15, 2007 8:18:38 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, October 14, 2007
I guess that this will not be my last post on this subject, but I want to start sharing some thoughts and tips from my experience when designing and building DWH. In this post I will focus on the fact & dimensions tables relationship in terms of data completeness (if you wonder what it is, read on).

Before you start to design the DWH, sit and talk with the people who built the systems which you take your data from, including the DBA. For every table, ask them what is the primary key (it's NOT always defined properly in the DB), then ask them again and then ask them if they are sure. It happened to me that I discovered that the systems guys were wrong about their DB's primary keys.
The same thing is about Foreign keys and here you should be even more careful. Even if they claim so, check yourself that every foreign key in the fact table is placed correctly in the dimension table, especially when the fact table has far history records. Sometimes system developers or even worse - system DBAs delete records from the dimension tables that are not relevant. This will cause that these keys will still be in the fact's history records but will not be found in the dimension table, causing uncomplete relationship between the fact and the dimension table.

So far is about the part when you talk and "investigate" the system developers (the DWH design). What to do when you actually developing the DWH? First, develop the tables of your dimensions tables. Do not forget to add the primary keys in the dimensions tables and the primary and foreign keys in the fact table. Then develop the ETL processes and go for the dimensions first.  If you know that the dimension has completeness problems with the fact table that you will develop later (you talked with the system developers, remember?), add UNDEFINED (UD key) record for the dimension table. Later, when developing the fact table's ETL process, make Join with the dimension table and check that the records' foreign key exists there. If not - change the key to UD. In SSIS and Informatica (and I guess that also in other products I don't know, such as DataStage) you can use Lookup instead of Joiner if the dimension table is less that 1G records. That will optimize the ETL process. After you developed all your ETLs, run the dimension processes. After they finish (assuming everything went OK) run the fact table's ETL process. If it succeeded you can go and have a drink. If not - check what went wrong. If you want to know which keys didn't showed up in the dimension table and causes the incompleteness problem, you can disable (not delete) the foreign key from the fact table and run the process again. Then, with a simple SQL query, check which foreign keys don't exist in the dimension table. Go back to your ETL design and check what you did wrong. As I pointed before, in this step you might be very angry at the system developers...

That is all for now. As I said, I assume that more ideas will come on in the future.

Sunday, October 14, 2007 7:11:49 AM (Jerusalem Standard Time, UTC+02:00)