Wednesday, October 13, 2010

This post is the second one in a series about HR BI systems. The previous post dealt with pricing and prioritization of the project. In this post I will describe the initial modeling process - familiarity with source systems, the main/major problems and their effect on the ERD model.

When starting with a new BI system design, the first question to be asked is: What are the business questions that interests the customer? It depends, of course, on who is the client, but that was covered in the previous post. Unfortunately I can't reveal the content world for whom I built the system and therefore to I'll use a University's content world as example. Almost all the questions that the customers wanted to get answered had totals that combine different subjects. For example:

  • What is the average salary in each department?
  • How many professors and students were in each department in each year in the last five years?
  • What is the ratio of students / lecturers in each department in the last years? (Graph)
  • How many transitions between departments (by students and lecturers) took place over the years? Which department is the most abandoned and which is the most popular?
  • Display events of lecturer / student from admission to the university until this day
This list is only a small part of what our customers wanted to know. Like always, there were no answers to these questions because it required integration between multiple sources. While getting familiar with the content world, I noticed several key points that will be treated later on in the project (and in future posts):
  • There's a very difficult problem of terminology in the organization ("one truth"). For example, the most basic terms - a student and lecturer are not consensus. For example, what happens when a student is also a lecturer? What is the lecturer's department if he lectures in more than one department? If a student moved to another department and returned after a semester is it considered to be one, two transitions or maybe none? Of course, different people had different answers to these questions.
  • All sources came in the same format - appendix. For those who not familiar, appendix is a table containing the ID, date and nature of change (kind of a diary on a given subject, "Nigreret" in Hebrew). That is, each record shows only the new / current situation (but the table is historical). When you want to ask a question relating to only one appendix (salary, for example) there is no problem. The problems arise when asking questions that combine several appendixes (salary and department, for example). There is a real need for carrying out the integration of the appendixes. Well, This is what Datawarehouse meant to do.
  • We did not find the Fact table. On one hand it seems that every appendix is a fact table because it describes changes / events that took place in the world, but ... Where are the measures? On the other hand, appendix looks like a Slowly Changing Dimension (SCD) but ... Where's the fact table? No doubt that this is a Factless Fact.

So we're got familiar with the main questions, the sources of information (same format - easier for us) and the unique attributes of this project. Now comes the hard question - how we'll design the data warehouse? Our initial solution looks like this:

In other words - classic Star Schema. The appendixes are SCD and the Fact table is in the center of the star. The fact will contain only one record per person and it will contain only static data of the person, such as ID, name and date of birth. This model takes us back to the questions about how the linkage between dimensions and fact will be made and what is the measure the fact. We tried to think how we answer the most basic questions required by the system (by the way, this is the way to see the quality of the ERD):

  1. For each lecturer, what is the salary, department and current position? - We'll take from every dimension/appendix the current record of the lecturer. The link between the dimensions will be carried out by the fact (lecturer's record) using the ID column
  2. What is the salary, department and position of the lecturer on date X? - From every dimension we'll take only the record which its starting date is from before X and nearest to X and again we'll link the dimensions through the fact table. Hence, we should calculate analytically (analytical functions) for each record its end date (the day before the start of the next record) and then it's easy to query with those kind of questions by using "between" statements.
  3. Average of the salaries in each department - we'll pull out the list of lecturers in each department (from the department appendix). We'll choose a date that interests us, and for each person we'll pull only the record which contains that date. Then we'll do group by. Ostensibly it's not trivial in every classical reporting tool, but let's think for a moment: what do we do here? Group by members of a particular dimension by another dimension counting, where dimensions are linked through the fact table. Looks fine. So - What's the problem? The problem is that we are trying to aggregate/sum by a dimension's attribute and not by a measure, which is not possible in every reporting tool. (a good solution will work with many tools and not just with dedicated tool for a particular ERD)
  4. Questions involving only one appendix - very easy and do not require datawarehouse at all
  5. Show events from the first day in the university until that day (of a certain person) - not possible in classic report tool. Let's say we want to see the "event log" of the lecturer Y, sorted by starting date in ascending order. This requires a union of all the appendixes / dimensions and sort of all records by starting date, where there may be several different appendix records that start in the same date (even very possible. change in the position usually leads to a raise)
At that point we realized that we needed to retreat. That looked like a good ERD at first but the last question is very basic and our model proved wrong. How do we show the event log which is a combination of the appendixes? How can we identify processes when they occurring, such as increase in the amount of courses and salary which leads to a better position? These are all questions that the system should give them an answer and a classic star schema can't help us here.

What have we done? Continued next post ...

Wednesday, October 13, 2010 11:39:33 PM (Jerusalem Standard Time, UTC+02:00)

Hey, look! A post about OBIEE with no Javascript!! ;-)

OBIEE lets you have either actual values or percent in the data labels in pie charts. I've found a trick to show both: add a new column (I called it concat) in the criteria tab which is the concatenation of the entity (shop name, for example) and its measure value. This is a string column, so we'll need to convert the measure into string. It'll look something like that: LOGICAL_TABLE.ENTITY_NAME || ' ' || cast(FACT_TABLE.MY_MEASURE as char). Edit the graph/pie view and make the concat column be the legend axis and the measure be the pie values axis. In the "Chart Type Special" menu (the button with the star) choose "Show values as Percentage of total". The result will be that the pie values will have both entity name, actual value (because now, for OBIEE, its part of the enitity name) and the percent.

Enjoy!

 |  | 
Wednesday, October 13, 2010 6:34:04 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, October 12, 2010

After two months of writing only about OBIEE, it's time to write about what I did besides that. The main reason I have not written on anything else is a load of work on the HR BI system as a manager. In the future posts I will discuss the BI system that have taken my precious time lately and I will review the solution bottom up, i.e. from the pricing stage, going through the data warehouse structure to the implementation of the ETL and reports, of course. In this post I'll write about pricing and prioritization .

Part I - Pricing

I'll start with several leading questions in order to stimulate thinking.
When we start estimating prices of the BI project, how do we do it? The intuition of most of us (so I believe) is to get as much initial information about the project's content world as we can: What are the main problems, what is the result the customer wants to get, etc. Then we estimate the cost of the project, just like that. This rises the question of how accurate the assessment we've got? Furthermore, frankly - what is it based on? Past experience? Who's? Do we know how to explain it to our colleagues? Can update it when there are material changes to the requirements? If yes - how?
The purpose of this post is formalizing, even a little bit, the process of pricing the project. In the second part we will also prioritize.
Pessimist: How can it help? After all, in every case it is estimation. There are a lot of parameters in the project and we can't put every single one into the equation and even if it was possible, most are unknown parameters in the initial phase of pricing!
Miky: That's right, but the approximation can be done in a formal way and based on experience of others. It is much better than just a wild guess of numbers.
So we realized what we need this for. Let's start.

We want to price the project in terms of time (time can also be translated into money, of course). To start, we need to know and be familiar with the various processes in the organization that we can implement. To do this we will have to sit down with key people of the organization to its various parts (departments, sections, you name it) and just sit and learn. After we learned the various processes (converting these "conversations" to a list of processes is the result of experience and talent in information systems and BI systems in particular, So you may want to get help if you are new in this field) open Excel and write a list of all processes in one column. Then, we add the following columns:

  • Content world - column that contains the name of the process or content world
  • Short description - so those who work after us will understand what we meant
  • Times (that are required) - whether the analysis of process is required in the past, present or future (of course there could be a combination of them). This column is critical because any change in this could change dramatically the length of the project
  • Priority in HQ - how much they interested in this process. Number from 1 to 5.
  • Priority in field - how much they (intermediate level managers to the production floor) are interested in the implementation of the process. 1 to 5.
  • Source Systems - which systems are required to bring the information required to analyzing the process
  • Complexity of bringing this data - based on previous column, how complex is to "put your hands on data." The difficulties can be technological (legacy systems, systems with no interface exists, unknown technologies) or political (the DBA of the system does not cooperate, there is already a "mini-BI system" and they don't want to replace it) 1 to 5
  • Complexity of the process - with no relation to the previous column, how much the process is complex. To simplify the matter, what will be the length of the design document? 1 to 5
  • Additional notes
In this simple excel sheet we've gathered the raw data with which we will analyze the estimated duration of the project. It is quite possible that at a later stage of the project we'll go back and update this data sheet.

Now let's take a moment and get out from our project's scope and think big.
Objective: Build a costs matrix for each part / module of the project. Let's see an example:

Easy Medium Difficult
Preliminary Design 2 5 10
Detailed Design 2 5 12
ERD creation 1 1 1
ETL - Dimension 1 3 7
ETL - Fact 1 6 15
Reporting Model 1 6 10
OLAP Project 1 3 7
Reports 0.3 0.5 1
OLAP Reports 0.3 0.5 1.5
Web Site 2 10 20
Security 3 6 10

The numbers are in terms of days (not hours). For example, the construction of report in medium difficulty level will take about half a day. It is important to emphasize that this is actual time and not including any buffers, but it is including unit tests. The buffers will be added later (those who read the book "Critical Chain" know how catastrophic it is to add the buffers now). Let's go on: we'll take the first process that we consider implementing and write its quantities matrix:

Process #1
Easy Medium Difficult
Preliminary Design 1
Detailed Design 1 1
ERD creation 1
ETL - Dimension 1
ETL - Fact 1 1
Reporting Model 1
OLAP Project 1
Reports 1
OLAP Reports 1
Web Site 1
Security 1

Pessimist: Wait a minute. I learned about the WBS methodology in which I take any part / module in the project and break it into its atomic elements. What's the difference?
Miky: Nothing at all and very much. We also break the big problem into smaller parts, but here we utilize the fact that the components are identical. Every project is composed of those parts - design, ETL, dimensions, cubes, reports etc.

After we multiplied the matrixes and summarized the result we got the estimated cost for the first process. In our case we get 53 (after rounding) days. Let's call this number M. Now we add a few more fixed costs of our process:

  • Developer Tests - 6
  • User tests - 2
  • Installation - 2
In addition to these costs there are also constants to be added to the equation:
  • Gross - 1.3 (vacations, illnesses, etc.)
  • Management - 1.15 (time spent on the project manager's job)
  • Buffer - 1.1
These fixed values are the result of experience and of course they will be updated along time and the various projects. We'll multiply M with the constants, and then we add the fixed costs. Now we got 97.16 days. Divide by 20 (work days per month) and we get 4.8 months. It's the cost for the implementation of the first process. It should be emphasized that this number is in terms of work, which means theoretically that if you assign two people to work on the project it will last only 2.4 months. It is theoretically and not exact for the simple reason that there are constraints and dependencies in the development process (let's not go into that now).
Doing this for every process, we'll get the estimated time for every process to implement.
Pessimist: Sorry to bother, but... Is it not too much work just to get the pricing? A skilled manager will guess pretty much the same numbers, right?
Miky: First of all - not sure. Second and more important - this method gives the manager and the entire team a common language for schedule and pricing. So it's important to write the matrix with the developers or at least review it with them. It is important to make sure the team knows the matrix and its implications for the schedule. In addition, this method also provides a common language with the customer. If the client wants to understand why the project will take so much time / money you can sit down and explain. If the client wants to reduce costs you can talk with him in terms of Dimensions and Facts, make a simulation and see how much the reducting of a one difficult dimension will spare the entire project cost. Nice, isn't it?
 
Part Two - Prioritization
 
As before, we'll start with the basic questions: how we decide which process we'll implement first? Will it be the process that the CEO wants to see implemented first? Maybe the process that will provide the maximum benefit for the entire organization? Perhaps the easiest and the most simple?
Now that we have the data in frost of our eyes, we repeat the question from which we started: how we decide which one we'll implement first? Here's the answer:
Explanation: The numbers in parentheses is the work in terms of months. Axis X is ease of implementation (not difficulty, for convenience) and it is obtained by weighting the data reception complexity from the source systems and complexity of the process (you can weight these numbers with your own custom formula). The Y axis is the priority, where I gave a factor of 0.8 to the HQ and 0.2 to the field's priority, but even here you can weigh as appropriate for your organization. This graph/map is definitely a product you can show your client or decision-makers and sit down together to decide where to start. Conventional wisdom says to start from the upper-right quadrant (most important and easiest) but of course there are additional considerations that do not appear here. For example, there is a tiny and unimportant consideration such as ...How much time do we have???
 
Is it the end of our work?
Of course not! The costs matrix requires maintenance, but over time it will become more and more accurate and stable. The project manager should monitor the progress schedule to know how long or rather how much work spent in each stage of the development. Sometimes we'll find that the way we written the quantities-matrix is not good and in that case we will come and update it. In addition, over time we'll find that the values in the pricing matrix are decreasing due to specialization and skills acquisition of the development team.
 
Thanks for Arkady Gurevich, my team master who guided me in implementing this idea and helped the project in general.
 
Now that we decided with which processes we'll start the project, we can move forward. See you in future posts where we start to plan the HR BI system.
 |  | 
Tuesday, October 12, 2010 11:46:45 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, October 03, 2010

I can see that many developers asked on the forums how to dynamically hide column in OBIEE answers. You can find some answers there but they're obviously not elegant. In this post I'll show my implementation to solve this problem. As always, I must mention that I use version 10.1.3.4 of OBIEE and that I strongly recommend first reading my previous posts regarding OBIEE, especially the post where I'm explaining how to add Javascript to your OBIEE answers.

Read carefully the attached JS file and understand it. You must edit the numberOfResultTablesInPage variable and add your custom logic which will define which columns need to be hidden. I added my example there. The code is very simple: every time our logic determined that a columns need to be hidden, call the function hideColumn(colNumber, resultTable). Pay attention to the order of the calls to this function. Always start from the higher-numbered column.

Enjoy.

dynamicColumnHiding.js (1.77 KB)
 |  | 
Sunday, October 03, 2010 9:16:49 PM (Jerusalem Standard Time, UTC+02:00)