This post is the third in a series about HR BI systems. In the previous post I've showed the basic design problems. In this post I'll show how we solved them. I recommend reading the previous post before reading this one.
I finished the previous post with a lot of questions and zero answers. I have shown that the classic star-schema won't work here because it can't answer the most basic questions in the HR business world. At this point we decided to stop and make the most reasonable move - use the wisdom of others which dealt with similar problems. We opened the classic book of Kimball, who's considered to be the father of DataWarehousing. I sat for a while and read the HR chapter. Between the lines, I recognized that Kimball had the same problems we did, which brought him to model the DWH as follows:
Instead of classic star-schema, we'll build here a very wide table which we'll call the Diary. This Diary will contain all the appendixes. For example, in our case (look at the previous post) the fields are: ID, start_effective_date, end_effective_date, first name, last name, more static fields (which don't change with time), department, more fields from the department appendix, position, more fields from the position appendix, course, more fields from the course appendix, payroll, more fields from the payroll appendix, IsCurrentRow. The basic logic of this idea is that each row was created because there's a record in one of this person's appendixes which was created in that date (or maybe in more than one appendix). We take the data from the record in the corresponding appendix and bring the relevant data for that date from the other appendixes. For a better understanding, I'll show a little example. Here are the four appendixes: (Note that for the sake of the example, a student/lecturer can have only one course at a time)
| Department Appendix |
| 012 |
01/01/2000 |
History |
| 012 |
01/06/2005 |
Computers |
| Position Appendix |
| 012 |
01/01/2000 |
Student |
| 012 |
01/01/2006 |
Lecturer |
| Courses Appendix |
| 012 |
01/01/2000 |
Middle Ages |
| 012 |
01/06/2005 |
Computer Basics |
| Payroll Appendix |
| 012 |
01/02/2006 |
Basic |
After making the "merge" of the appendixes this is how the Diary will look like:
| ID |
Start Date |
What Changed |
Name |
Payroll |
Department |
Position |
Courses |
End Date |
| 012 |
01/01/2000 |
Department Position Courses |
Miky Schreiber |
|
History |
Student |
Middle Ages |
31/05/2005 |
| 012 |
01/06/2005 |
Department Courses |
Miky Schreiber |
|
Computers |
Student |
Computer Basics |
31/12/2005 |
| 012 |
01/01/2006 |
Position |
Miky Schreiber |
|
Computers |
Lecturer |
Computer Basics |
31/01/2006 |
| 012 |
01/02/2006 |
Payroll |
Miky Schreiber |
Basic |
Computers |
Lecturer |
Computer Basics |
31/12/2999 |
So what have we done here?
We built a diary that combines all the other appendixes. For each record, or point in time, you can see what was the situation of the person from every appendix's perspective. That is, if on date X only the department appendix was changed, you can still see what was the situation in the position appendix. Hence, you can ask business questions that combines several appendixes even when these questions are aggregative. For example: how many students which have the lecturer position get paid basic salary? I invite the reader to read the business questions in the previous post and see how you can answer them using this model. You must understand that with this model you must always fetch only one row per person or otherwise you'll have wrong results. In order to do that you need to filter the table on a particular day (using expression like: where myDate between StartDate and EndDate) or use IsCurrentRow=1 (you can calculate this column in the ETL).
How did we do that?
I won't cover here the entire ETL design, but I'll explain the key principles: First, we build the avenue. The avenue is a small table with only two columns - the ID of the person and the StartDate. We make distinct query on all the appendixes and get, for every person, all the start date from all the appendixes and this is how we build the avenue. Once we have the avenue, we join it with every appendix, and this is how we bring the values from the appendixes themselves. Now, for each start date we have the data of the appendixes where this start date came from. In order to "complete" the data from the other appendixes, we "spread" the missing value from the previous row. For example, on 1/6/2005 the position status hasn't changed so after the join the Position field will be null, so we take the value from the previous row.
The problem of Irrelevance
After making the above, we had a problem that floated very quickly. As always, each column has a foreign key to its dimension table, so the values cannot be null. The problem is that sometimes the values must be null - it happens in the first rows of each person, when not all appendixes has their values yet. For example, look at the payroll column in the "diary" table above. It gets a value only in the last row. We can't have null in this field because we want to have foreign key but we don't have any value yet. What can we do? The solution we found is a "N/A" value which we added in such situations. Note that you need to add a new row with the "N/A" key in every dimension table.
Is that all? of course not! Using this "diary" table we can answer 90% of the business question we faced. This table is basically the whole project. But, there are more challenges we have to solve. I'll leave it for the future posts, where I'll show how we can make the HR BI system smarter.