Monday, March 07, 2011

This post is the fifth (and last, I think) in a series about HR BI systems. In the previous post I've shown how to make the BI system smarter by cleaning records and finding Super-Transactions. In this post I'll complete a few last points and introduce you to some interesting reports I built.

Finding Super Transactions

In the last post, I introduced you the concept of Super Transactions (ST). I explained how to make the ETL process find them in the raw data and how to model them in the DWH and let the end-users use them in their analysis. However, I didn't explain how to find them in the initial design phase.

Objective: Find the material processes that happens to many people.
The main problem: There are many people and we don't know where to start looking! We should think about a common denominator to all the processes in order to find them.
The key observation: People are boring. Meaning, in terms of their statuses or various trailers. Think about your personal diary (in terms of what we talked about in the previous posts), or the "log" of your life. Think how many times you changed your martial status, salary, workplace status, etc. Yes, you have no more than a few records. You change something only once in a couple of years (in the best case).
On the other hand, when an employee undergoes a material change (ST), there are many changes that happens (almost) at once in different appendixes (or subjects). When the student changes his position in the university, he changes in short time his position, salary, courses and more.
Let's show it in a simple graph. The X axis is the time line and the Y axis is the number of appendixes that changed:


This shows us that most of the time our life is static and suddenly you find there are "peaks" that indicates that something "big" happened. This is where we'll look for the STs.
How we'll find them? I'll take you back to the advanced modeling of the project. For every record in the diary we added end_effective_date column which has the date of the next record minus one day. That enable us to write between statements and make our life more simple. Back to our case: using analytic functions, I added a field that says how much time passed between the current and the previous record. As I said, the number are very large in most cases. The cases where there's some records with small numbers are the cases we are looking for. It says that something important happended. Using this method I found eight out of nine (!!) which implemented in the system.

Incremental Load

In an early stage of the project we realized that it will be very "unwise" to load all the people all the time. As I mentioned, most of the people don't change every week. The problem is that incremental load of the diary ("log") of people is not simple at all. The first reason is that we need to update the end_effective_date and ind_current of the last row, but that's not difficult. The second and main reason is that updating and adding one record is not enough: in case that the man was in a middle of a ST we need to re-calculate the ST. This is why we decided to completely delete the records of the people that changed (in one or more appendixes) and rebuild them from scratch. This is why we run the ETL only on for the people that changed. This solution is very simple and easy to implement. All we need to do is to delete these people's records from all three diaries and then run the ETL only on them. Thus, the ETL process is very short (half hour total) in every night and only the first run (on all people) is (very...) long.

Interesting Reports

I'll show here some interesting reports I built for the users in order to stimulate your thinking:

  • Organizational Pyramid - the ratio, over the years, between the number of different populations. For example, the ratio of senior lecturers, lecturers and students. The report is shown as stacked graph bar. The graph answers the basic question: Do we have enough students to make them enough teachers in the coming years? Do we have enough lecturers to make them enough senior lecturers in the coming future? In this picture I built a very similar graph in Excel that shows how many students of every year we have.

  • Students Rift - For every faculty, the gap between the number of students that the university wanted to have and the actual number of students. You can change the time parameter and see the screen as it was on another date. You can also drill-down on every faculty and see the raw numbers.
  • Number of Promotions & Retirements - For every year, how many promotions & retirements were made. You can drill-down the numbers and see the names and details of the people.
  • Grades Graph - Classic line graphs othat shows grades for a lecturer or faculty over the timeline. Excellent tool to discover interesting trends such as professors who write too difficult exams or decrease in the academic level of a faculty.

To conclude: I hope you enjoyed the posts about BI HR. I got some mails regarding these posts and I'm glad to hear that I'm helping people here in Israel and even over the world.

Monday, March 07, 2011 9:31:10 PM (Jerusalem Standard Time, UTC+02:00)
 Wednesday, January 05, 2011

This post is the fourth in a series about HR BI systems. In the previous post I've shown to advanced model of the solution, the ERD of the datawarehouse and I shown how to answer most of the business questions using that model. In this post I'll show a more complex issue which is smart records filtering and Super Transactions. I will introduce you to this subject and show you how I implemented it.

To better understand this post, I highly recommend you to read the previous post first. Let's summarize the last post: We reached to the conclusion that in order to answer most of the business question regarding HR, we need to build a very wide Fact table (meaning that it will have many fields) that contains all the relevant fields from the appendixes. I introduced you to the problem of irrelevance and I shown the way to solve it using the N/A value instead of null when the appendix hasn't started yet.

Actually, after doing what I described in the last post we finished 90% of the project. Now, only the other 90% remains... ;-) The thing is that another main requirement remains (although basically it was our idea). The reason for which I didn't mention it in the previous posts is that I didn't want to bring in this complex issue in the first steps. I wanted you, my reader, to stay focused on the main structure of the solution. The requirement is to clean and improve (from the data point-of-view) the personal diary records and identify personal processes. I will explain later what this means. Note that at this point the example of the university is not so relevant so I apologize for the weird examples.

Cleaning Records

For example, let's say that we have one lecturer (or more) that has swift changes in his salary: During the month his salary changes but in the end of the month it always goes back to the original salary, meaning that he always gets the same salary in his bank account. I know it sounds weird, but I've seen a lot more confusing cases in financial systems. If we'll bring the lecturer's data to the DWH as it is, we'll have a number of serious problems:

  1. In case that the user filter the report on the date where the lecturer has the temporary salary, the user will get a wrong result because he won't see the "real" salary ("real" is a problematic word in this case. We'll discuss it later).
  2. The rapid changes in the lecturer's salary appendix will cause great growth in its diary. This is problematic because of disk space, but the main problem is that the diary that will be shown to the end-user will be "junky" with a lot of unnecessary records which will prevent the end-user from seeing the true changes in the diary. In other words, we want a cleaner personal diary.
  3. On the other hand, there are other users (especially business analysts) that want to analyze exactly these cases so we can't just delete the records...

Super Transactions

Super Transaction, or ST, is a personal process or change in the man's life which reflects some appendixes and can happen on more than one day. For example, let's say that we have a tutor that got a promotion and becomes a lecturer. By doing that, he changes his Position (tutor -> lecturer), Courses (now he teaches other courses) and the most important of all - the Salary. These changes won't have to happen exactly on the same day. Its position can be changed when he got the announcement but the salary will change only in the end of the month. From the technical point of view, we can say that the man only had some changes in some of his appendixes. But from the business point of view, he has a serious career change aka he been over a process, or a ST.

The requirement is to identify and flood these STs and even let the users investigate and filter by them (You can read more about it in Kimball's book).

So, how we'll make it real? In particular, how we'll let the users see both the original and the cleaner data? First things first, so let's talk about the big picture - the model of the solution.

General Model

To fulfill these requirements we'll define three work modes. The user can switch between them at any time.

  1. Full Mode - In this mode, the user see exactly the data of the source systems. The records filtering and the ST detection won't take part here.
  2. Filtered Mode - In this mode the user will have cleaner records and he will see the ST.
  3. Integrated Mode - The user will see both the original data and the filtered records, so he'll can compare them and have a better understanding of how the system works.

Let's wear the hat of the DWH analyst. If we'll want to add a filter to the table according to the work mode that user has chosen, we'll have a major problem. Every record can be belonging to more than one mode. A record that won't be filtered belongs to every mode. A filtered record only belongs to the filtered and the integrated mode. ST record (we'll talk about it later) also belongs to the filtered and the integrated mode. The naive solution says that we'll add three indicators (boolean columns) that will say for which work mode this record belongs and we'll filter the table using these columns according to the work mode the user has chosen. The problem is it will cause great performance problems so we have to create indexes for these columns. But, creating indexes for boolean records is not very affective because the variance is extremely small.

The solution we found is to split the diary into three tables, one table for every work mode. When the user changes the work mode, the connection will change and the system will take the data from another table. The great benefit in this solution is the performance gain. The query is mapped on the appropriate table and we don't need to filter according to the work mode. In order to implement this, you need a reporting tool that can dynamically change the connection, but I believe that any reporting tool can do it after some workarounds. In Reporting Services, you can do it by defining a Stored Procedure as the Data Source. The SP will get the required schema/table as a parameter and use it in the query (you can see code here). In OBIEE, you can do it by using value-based fragmentation (instructions here). The disadvantage is that you take unnecessary disk space, but we prefer performance over space.

The ETL process now has three targets, one for each work mode. The filters inside the ETL process define in which table, or work mode every record will be loaded.

Cleaning Records

Let's start from the most important thing: It's very important to make the customers understand that our BI system is not a artificial intelligence system and it can't guess which records need to be filtered. We need to make clear that the system will work by predefined rules. This is why the preliminary step is the most difficult: We need to define the rules, test them, improve them, test again and so on. A good example is that: If a person changes its status (in almost any appendix) and then changes it back or change it into another status, this is a "dirty" record which needed to be cleaned. This simple rule can filter a lot of noise in the data. It happens a lot of times that a user of the source system changes something, see that he has mistaken and turning things back. The problem is that some of the system only add rows and don't update them is such cases. This is why it makes sense to filter these records.

After we make sure that we and the customers understand each other, we can start with the implementation. This can be very easy or very difficult, depends on the ETL tool you're using and especially on the complexity of the rules you defined. While the ETL tools usually look on one record at a time, we need to compare the record with its predecessor, the record after it and sometimes even more. In this step, I recommend you to stop, take a paper & pen and start to think how to implement it before you're opening your ETL tool. The most important thing is to check that the results satisfy your expectations. Maybe you'll have performance problems here, but you can come back and handle them in the final steps of the project (personally, I prefer dealing with data correctness in the first steps and performance in the last steps). As I said, after you decide (in the ETL process) whether it's a regular, filtered or ST record, you'll route it to the appropriate target(s).

Super Transactions (ST)

This is the most difficult part of the project. Take a deep breath.
I'll remind you what it is from the business view and this time also from the technical point of view. ST is a personal change, or process that happened to a man. This change will often be reflected in more than one appendix and can happen on more than one day. Remind yourself with the tutor that became lecturer. The salary change happened weeks after the position change. The requirement is to flood that a big change happened here and not only some changes in some appendixes.
Let's go on with the technical stuff. There are some basic questions regarding ST:

  1. What will happen with the records that are contained inside the ST? contained record is a record which is a part of the change (salary change or course change in our example). If all the changes happened on the same day, will we mark the record as a ST record or maybe we'll add a new ST record? If the changes happened on more than one day, will we add a new ST record or maybe we'll mark one of the records as the ST record? Which one? What will happen with the rest of the contained records?
  2. What will happen with records that occurred on the same time as the ST but don't belong to it? For example, let's say that when the tutor became lecturer he also became a father (Mazal Tov!). The diary will have a new record for this change, but it doesn't belong to the ST. What we'll do with this record?
  3. Will we run the ST detection before or after the record cleaning? It will change the results dramatically.

Enough with questions. let's start with the answers. We thought and got to the conclusion that there is a strong connection between the work modes and the ST.
We defined the full mode as the mode where the data is exactly the same as the source data. This is why we won't clean records here and we won't mark any ST records. We want to keep the balance with the source systems. Our greatest threat is that the users will say: "Hey, this is not what I see in my (source) system. This whole BI is crap!!!"

On the other hand, the filtered mode is where our system becomes smarter. Here we can add the record filtering and the ST detection. This is why here we'll mark ST records. Why mark and not add a new record? Because this will break our most basic rule - one man cannot have two records with the same date. This is not just a technical problem (we can add Is_ST column and add it to the table's primary key). This is a conceptual principal. We added a new column named ST_Name.(If it's null then the record is not a ST record).

Which record will we mark as the ST record? Very simple: As part of the design stage, we'll decide (along with the customers) for each ST which is the leading record and we'll mark it. Often, this is the record that will help us to detect the ST. Going back to our example, the ST record is the one where the tutor changed its position. In this record, we'll make ST_Name=Position_Change. Don't forget to mark the contained records as filtered and in the what_changed columns in the ST record, add the other contained appendixes. Back to the example, Write in the what_changed column that also the courses and salary changed as well. We'll also update these appendixes' column in the ST record with the new values.

What about not-contained records (such as the family status change)? Our solution completely covers that. The ST detection only handles with the appendixes that we defined as contained in the ST. Even if the change happened in the same date as the position change, it won't affect anything at all.
Regarding the order of the actions (ST detection or records filtering) - We thought about it and came to the conclusion that it is better to first clean the records because the ST detection process needs to see the best and most clear picture as it can for a better ST detection. We've been glad to see it's right in the tests we did.

Summary

The record cleaning and the ST detection that the BI system does is a quantum leap. The system does not only reflect the source data, but it also add an "artificial intelligence" layer which helps the user navigate the mixture of data. As a matter of fact, this makes the system turn data into knowledge. On the other hand, it doesn't forces itself on the end-user and it enables him to see the original source data. It's important to say that the design is a dynamic and "breathing" process that doesn't end on Day 1 of the system. As time goes by, things and business rules are getting clearer and better so we come back and sharp the rules.

In the next post (the last of the series in my opinion) I'll discuss even more interesting issues such as differential load, how to find ST in the design phase and some of the most interesting reports you can do with our HR DWH. Stay tuned.

Wednesday, January 05, 2011 6:29:09 PM (Jerusalem Standard Time, UTC+02:00)
 Sunday, November 14, 2010

This post is the third in a series about HR BI systems. In the previous post I've showed the basic design problems. In this post I'll show how we solved them. I recommend reading the previous post before reading this one.

I finished the previous post with a lot of questions and zero answers. I have shown that the classic star-schema won't work here because it can't answer the most basic questions in the HR business world. At this point we decided to stop and make the most reasonable move - use the wisdom of others which dealt with similar problems. We opened the classic book of Kimball, who's considered to be the father of DataWarehousing. I sat for a while and read the HR chapter. Between the lines, I recognized that Kimball had the same problems we did, which brought him to model the DWH as follows:

Instead of classic star-schema, we'll build here a very wide table which we'll call the Diary. This Diary will contain all the appendixes. For example, in our case (look at the previous post) the fields are: ID, start_effective_date, end_effective_date, first name, last name, more static fields (which don't change with time), department, more fields from the department appendix, position, more fields from the position appendix, course, more fields from the course appendix, payroll, more fields from the payroll appendix, IsCurrentRow. The basic logic of this idea is that each row was created because there's a record in one of this person's appendixes which was created in that date (or maybe in more than one appendix). We take the data from the record in the corresponding appendix and bring the relevant data for that date from the other appendixes. For a better understanding, I'll show a little example. Here are the four appendixes: (Note that for the sake of the example, a student/lecturer can have only one course at a time)

Department Appendix
012 01/01/2000 History
012 01/06/2005 Computers
Position Appendix
012 01/01/2000 Student
012 01/01/2006 Lecturer
Courses Appendix
012 01/01/2000 Middle Ages
012 01/06/2005 Computer Basics
Payroll Appendix
012 01/02/2006 Basic

After making the "merge" of the appendixes this is how the Diary will look like:

ID Start Date What Changed Name Payroll Department Position Courses End Date
012 01/01/2000 Department
Position
Courses
Miky Schreiber History Student Middle Ages 31/05/2005
012 01/06/2005 Department
Courses
Miky Schreiber Computers Student Computer Basics 31/12/2005
012 01/01/2006 Position Miky Schreiber Computers Lecturer Computer Basics 31/01/2006
012 01/02/2006 Payroll Miky Schreiber Basic Computers Lecturer Computer Basics 31/12/2999

So what have we done here?

We built a diary that combines all the other appendixes. For each record, or point in time, you can see what was the situation of the person from every appendix's perspective. That is, if on date X only the department appendix was changed, you can still see what was the situation in the position appendix. Hence, you can ask business questions that combines several appendixes even when these questions are aggregative. For example: how many students which have the lecturer position get paid basic salary? I invite the reader to read the business questions in the previous post and see how you can answer them using this model. You must understand that with this model you must always fetch only one row per person or otherwise you'll have wrong results. In order to do that you need to filter the table on a particular day (using expression like: where myDate between StartDate and EndDate) or use IsCurrentRow=1 (you can calculate this column in the ETL).

How did we do that?

I won't cover here the entire ETL design, but I'll explain the key principles: First, we build the avenue. The avenue is a small table with only two columns - the ID of the person and the StartDate. We make distinct query on all the appendixes and get, for every person, all the start date from all the appendixes and this is how we build the avenue. Once we have the avenue, we join it with every appendix, and this is how we bring the values from the appendixes themselves. Now, for each start date we have the data of the appendixes where this start date came from. In order to "complete" the data from the other appendixes, we "spread" the missing value from the previous row. For example, on 1/6/2005 the position status hasn't changed so after the join the Position field will be null, so we take the value from the previous row.

The problem of Irrelevance

After making the above, we had a problem that floated very quickly. As always, each column has a foreign key to its dimension table, so the values cannot be null. The problem is that sometimes the values must be null - it happens in the first rows of each person, when not all appendixes has their values yet. For example, look at the payroll column in the "diary" table above. It gets a value only in the last row. We can't have null in this field because we want to have foreign key but we don't have any value yet. What can we do? The solution we found is a "N/A" value which we added in such situations. Note that you need to add a new row with the "N/A" key in every dimension table.

Is that all? of course not! Using this "diary" table we can answer 90% of the business question we faced. This table is basically the whole project. But, there are more challenges we have to solve. I'll leave it for the future posts, where I'll show how we can make the HR BI system smarter.

Sunday, November 14, 2010 10:11:48 PM (Jerusalem Standard Time, UTC+02:00)
 Wednesday, October 13, 2010

This post is the second one in a series about HR BI systems. The previous post dealt with pricing and prioritization of the project. In this post I will describe the initial modeling process - familiarity with source systems, the main/major problems and their effect on the ERD model.

When starting with a new BI system design, the first question to be asked is: What are the business questions that interests the customer? It depends, of course, on who is the client, but that was covered in the previous post. Unfortunately I can't reveal the content world for whom I built the system and therefore to I'll use a University's content world as example. Almost all the questions that the customers wanted to get answered had totals that combine different subjects. For example:

  • What is the average salary in each department?
  • How many professors and students were in each department in each year in the last five years?
  • What is the ratio of students / lecturers in each department in the last years? (Graph)
  • How many transitions between departments (by students and lecturers) took place over the years? Which department is the most abandoned and which is the most popular?
  • Display events of lecturer / student from admission to the university until this day
This list is only a small part of what our customers wanted to know. Like always, there were no answers to these questions because it required integration between multiple sources. While getting familiar with the content world, I noticed several key points that will be treated later on in the project (and in future posts):
  • There's a very difficult problem of terminology in the organization ("one truth"). For example, the most basic terms - a student and lecturer are not consensus. For example, what happens when a student is also a lecturer? What is the lecturer's department if he lectures in more than one department? If a student moved to another department and returned after a semester is it considered to be one, two transitions or maybe none? Of course, different people had different answers to these questions.
  • All sources came in the same format - appendix. For those who not familiar, appendix is a table containing the ID, date and nature of change (kind of a diary on a given subject, "Nigreret" in Hebrew). That is, each record shows only the new / current situation (but the table is historical). When you want to ask a question relating to only one appendix (salary, for example) there is no problem. The problems arise when asking questions that combine several appendixes (salary and department, for example). There is a real need for carrying out the integration of the appendixes. Well, This is what Datawarehouse meant to do.
  • We did not find the Fact table. On one hand it seems that every appendix is a fact table because it describes changes / events that took place in the world, but ... Where are the measures? On the other hand, appendix looks like a Slowly Changing Dimension (SCD) but ... Where's the fact table? No doubt that this is a Factless Fact.

So we're got familiar with the main questions, the sources of information (same format - easier for us) and the unique attributes of this project. Now comes the hard question - how we'll design the data warehouse? Our initial solution looks like this:

In other words - classic Star Schema. The appendixes are SCD and the Fact table is in the center of the star. The fact will contain only one record per person and it will contain only static data of the person, such as ID, name and date of birth. This model takes us back to the questions about how the linkage between dimensions and fact will be made and what is the measure the fact. We tried to think how we answer the most basic questions required by the system (by the way, this is the way to see the quality of the ERD):

  1. For each lecturer, what is the salary, department and current position? - We'll take from every dimension/appendix the current record of the lecturer. The link between the dimensions will be carried out by the fact (lecturer's record) using the ID column
  2. What is the salary, department and position of the lecturer on date X? - From every dimension we'll take only the record which its starting date is from before X and nearest to X and again we'll link the dimensions through the fact table. Hence, we should calculate analytically (analytical functions) for each record its end date (the day before the start of the next record) and then it's easy to query with those kind of questions by using "between" statements.
  3. Average of the salaries in each department - we'll pull out the list of lecturers in each department (from the department appendix). We'll choose a date that interests us, and for each person we'll pull only the record which contains that date. Then we'll do group by. Ostensibly it's not trivial in every classical reporting tool, but let's think for a moment: what do we do here? Group by members of a particular dimension by another dimension counting, where dimensions are linked through the fact table. Looks fine. So - What's the problem? The problem is that we are trying to aggregate/sum by a dimension's attribute and not by a measure, which is not possible in every reporting tool. (a good solution will work with many tools and not just with dedicated tool for a particular ERD)
  4. Questions involving only one appendix - very easy and do not require datawarehouse at all
  5. Show events from the first day in the university until that day (of a certain person) - not possible in classic report tool. Let's say we want to see the "event log" of the lecturer Y, sorted by starting date in ascending order. This requires a union of all the appendixes / dimensions and sort of all records by starting date, where there may be several different appendix records that start in the same date (even very possible. change in the position usually leads to a raise)
At that point we realized that we needed to retreat. That looked like a good ERD at first but the last question is very basic and our model proved wrong. How do we show the event log which is a combination of the appendixes? How can we identify processes when they occurring, such as increase in the amount of courses and salary which leads to a better position? These are all questions that the system should give them an answer and a classic star schema can't help us here.

What have we done? Continued next post ...

Wednesday, October 13, 2010 11:39:33 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, October 12, 2010

After two months of writing only about OBIEE, it's time to write about what I did besides that. The main reason I have not written on anything else is a load of work on the HR BI system as a manager. In the future posts I will discuss the BI system that have taken my precious time lately and I will review the solution bottom up, i.e. from the pricing stage, going through the data warehouse structure to the implementation of the ETL and reports, of course. In this post I'll write about pricing and prioritization .

Part I - Pricing

I'll start with several leading questions in order to stimulate thinking.
When we start estimating prices of the BI project, how do we do it? The intuition of most of us (so I believe) is to get as much initial information about the project's content world as we can: What are the main problems, what is the result the customer wants to get, etc. Then we estimate the cost of the project, just like that. This rises the question of how accurate the assessment we've got? Furthermore, frankly - what is it based on? Past experience? Who's? Do we know how to explain it to our colleagues? Can update it when there are material changes to the requirements? If yes - how?
The purpose of this post is formalizing, even a little bit, the process of pricing the project. In the second part we will also prioritize.
Pessimist: How can it help? After all, in every case it is estimation. There are a lot of parameters in the project and we can't put every single one into the equation and even if it was possible, most are unknown parameters in the initial phase of pricing!
Miky: That's right, but the approximation can be done in a formal way and based on experience of others. It is much better than just a wild guess of numbers.
So we realized what we need this for. Let's start.

We want to price the project in terms of time (time can also be translated into money, of course). To start, we need to know and be familiar with the various processes in the organization that we can implement. To do this we will have to sit down with key people of the organization to its various parts (departments, sections, you name it) and just sit and learn. After we learned the various processes (converting these "conversations" to a list of processes is the result of experience and talent in information systems and BI systems in particular, So you may want to get help if you are new in this field) open Excel and write a list of all processes in one column. Then, we add the following columns:

  • Content world - column that contains the name of the process or content world
  • Short description - so those who work after us will understand what we meant
  • Times (that are required) - whether the analysis of process is required in the past, present or future (of course there could be a combination of them). This column is critical because any change in this could change dramatically the length of the project
  • Priority in HQ - how much they interested in this process. Number from 1 to 5.
  • Priority in field - how much they (intermediate level managers to the production floor) are interested in the implementation of the process. 1 to 5.
  • Source Systems - which systems are required to bring the information required to analyzing the process
  • Complexity of bringing this data - based on previous column, how complex is to "put your hands on data." The difficulties can be technological (legacy systems, systems with no interface exists, unknown technologies) or political (the DBA of the system does not cooperate, there is already a "mini-BI system" and they don't want to replace it) 1 to 5
  • Complexity of the process - with no relation to the previous column, how much the process is complex. To simplify the matter, what will be the length of the design document? 1 to 5
  • Additional notes
In this simple excel sheet we've gathered the raw data with which we will analyze the estimated duration of the project. It is quite possible that at a later stage of the project we'll go back and update this data sheet.

Now let's take a moment and get out from our project's scope and think big.
Objective: Build a costs matrix for each part / module of the project. Let's see an example:

Easy Medium Difficult
Preliminary Design 2 5 10
Detailed Design 2 5 12
ERD creation 1 1 1
ETL - Dimension 1 3 7
ETL - Fact 1 6 15
Reporting Model 1 6 10
OLAP Project 1 3 7
Reports 0.3 0.5 1
OLAP Reports 0.3 0.5 1.5
Web Site 2 10 20
Security 3 6 10

The numbers are in terms of days (not hours). For example, the construction of report in medium difficulty level will take about half a day. It is important to emphasize that this is actual time and not including any buffers, but it is including unit tests. The buffers will be added later (those who read the book "Critical Chain" know how catastrophic it is to add the buffers now). Let's go on: we'll take the first process that we consider implementing and write its quantities matrix:

Process #1
Easy Medium Difficult
Preliminary Design 1
Detailed Design 1 1
ERD creation 1
ETL - Dimension 1
ETL - Fact 1 1
Reporting Model 1
OLAP Project 1
Reports 1
OLAP Reports 1
Web Site 1
Security 1

Pessimist: Wait a minute. I learned about the WBS methodology in which I take any part / module in the project and break it into its atomic elements. What's the difference?
Miky: Nothing at all and very much. We also break the big problem into smaller parts, but here we utilize the fact that the components are identical. Every project is composed of those parts - design, ETL, dimensions, cubes, reports etc.

After we multiplied the matrixes and summarized the result we got the estimated cost for the first process. In our case we get 53 (after rounding) days. Let's call this number M. Now we add a few more fixed costs of our process:

  • Developer Tests - 6
  • User tests - 2
  • Installation - 2
In addition to these costs there are also constants to be added to the equation:
  • Gross - 1.3 (vacations, illnesses, etc.)
  • Management - 1.15 (time spent on the project manager's job)
  • Buffer - 1.1
These fixed values are the result of experience and of course they will be updated along time and the various projects. We'll multiply M with the constants, and then we add the fixed costs. Now we got 97.16 days. Divide by 20 (work days per month) and we get 4.8 months. It's the cost for the implementation of the first process. It should be emphasized that this number is in terms of work, which means theoretically that if you assign two people to work on the project it will last only 2.4 months. It is theoretically and not exact for the simple reason that there are constraints and dependencies in the development process (let's not go into that now).
Doing this for every process, we'll get the estimated time for every process to implement.
Pessimist: Sorry to bother, but... Is it not too much work just to get the pricing? A skilled manager will guess pretty much the same numbers, right?
Miky: First of all - not sure. Second and more important - this method gives the manager and the entire team a common language for schedule and pricing. So it's important to write the matrix with the developers or at least review it with them. It is important to make sure the team knows the matrix and its implications for the schedule. In addition, this method also provides a common language with the customer. If the client wants to understand why the project will take so much time / money you can sit down and explain. If the client wants to reduce costs you can talk with him in terms of Dimensions and Facts, make a simulation and see how much the reducting of a one difficult dimension will spare the entire project cost. Nice, isn't it?
 
Part Two - Prioritization
 
As before, we'll start with the basic questions: how we decide which process we'll implement first? Will it be the process that the CEO wants to see implemented first? Maybe the process that will provide the maximum benefit for the entire organization? Perhaps the easiest and the most simple?
Now that we have the data in frost of our eyes, we repeat the question from which we started: how we decide which one we'll implement first? Here's the answer:
Explanation: The numbers in parentheses is the work in terms of months. Axis X is ease of implementation (not difficulty, for convenience) and it is obtained by weighting the data reception complexity from the source systems and complexity of the process (you can weight these numbers with your own custom formula). The Y axis is the priority, where I gave a factor of 0.8 to the HQ and 0.2 to the field's priority, but even here you can weigh as appropriate for your organization. This graph/map is definitely a product you can show your client or decision-makers and sit down together to decide where to start. Conventional wisdom says to start from the upper-right quadrant (most important and easiest) but of course there are additional considerations that do not appear here. For example, there is a tiny and unimportant consideration such as ...How much time do we have???
 
Is it the end of our work?
Of course not! The costs matrix requires maintenance, but over time it will become more and more accurate and stable. The project manager should monitor the progress schedule to know how long or rather how much work spent in each stage of the development. Sometimes we'll find that the way we written the quantities-matrix is not good and in that case we will come and update it. In addition, over time we'll find that the values in the pricing matrix are decreasing due to specialization and skills acquisition of the development team.
 
Thanks for Arkady Gurevich, my team master who guided me in implementing this idea and helped the project in general.
 
Now that we decided with which processes we'll start the project, we can move forward. See you in future posts where we start to plan the HR BI system.
 |  | 
Tuesday, October 12, 2010 11:46:45 AM (Jerusalem Standard Time, UTC+02:00)