Tuesday, March 29, 2011

We're doing something very useful at our workplace and because I believe in collaboration I would like to share it with you, dear readers. I'm sure it will help a lot of system analysts and developers.

When we write down the design of the ETL process, especially regarding complex fact tables, we find it necessary to write it down as a flowchart. This creates a simple and common language between the designer/analyst and the developer and even with the customer. It's much easier for everyone to read a flowchart rather than reading a long and complex document. Therefore, we create diagrams such as these using Visio (this diagram was created using Google Docs, but it's preety much the same):

I only brought here the beginning of the diagram. The real one is 3 times bigger.
Note that every leaf in the tree has its own number. These numbers indicates the record's "case number". Now for the trick: In the target table (fact, dimension or anything else) we add a new column that contains the number of the case, according to the numbers shown in the diagram. This field is not exposed to the end-user. The pros of this trick are very significant:

  • Already in the early stages of the implementation we can check the statistics, meaning how many times each case happens and see if we got what we expected. This can help us find new things that we (and sometimes even the users) weren't aware of and find bugs in early stage of the development.
  • Each record can "tell its story" and tell us "why it was created". This saves us a lot of effort when we want to question the data and find the source of the problems.
  • During the tests we check two different things: Whether each record was "identified" correctly by the ETL (by checking the case number) and whether the values entered into the other fields are correct. While once we used to check only the values correctness and then find the problems, now we can detect the problems much faster.
After we came up with this idea, I thought of another one: Why won't we expose this new field to the users/analysts and let them "enjoy" this knowledge? Maybe these statistics can also help them with the data analysis? This requires us to build "cases" dimension and add a description for every case number. I don't think that this is a good idea in any circumstances, but you should know that it's an option.

Tuesday, March 29, 2011 11:20:39 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, March 07, 2011

This post is the fifth (and last, I think) in a series about HR BI systems. In the previous post I've shown how to make the BI system smarter by cleaning records and finding Super-Transactions. In this post I'll complete a few last points and introduce you to some interesting reports I built.

Finding Super Transactions

In the last post, I introduced you the concept of Super Transactions (ST). I explained how to make the ETL process find them in the raw data and how to model them in the DWH and let the end-users use them in their analysis. However, I didn't explain how to find them in the initial design phase.

Objective: Find the material processes that happens to many people.
The main problem: There are many people and we don't know where to start looking! We should think about a common denominator to all the processes in order to find them.
The key observation: People are boring. Meaning, in terms of their statuses or various trailers. Think about your personal diary (in terms of what we talked about in the previous posts), or the "log" of your life. Think how many times you changed your martial status, salary, workplace status, etc. Yes, you have no more than a few records. You change something only once in a couple of years (in the best case).
On the other hand, when an employee undergoes a material change (ST), there are many changes that happens (almost) at once in different appendixes (or subjects). When the student changes his position in the university, he changes in short time his position, salary, courses and more.
Let's show it in a simple graph. The X axis is the time line and the Y axis is the number of appendixes that changed:


This shows us that most of the time our life is static and suddenly you find there are "peaks" that indicates that something "big" happened. This is where we'll look for the STs.
How we'll find them? I'll take you back to the advanced modeling of the project. For every record in the diary we added end_effective_date column which has the date of the next record minus one day. That enable us to write between statements and make our life more simple. Back to our case: using analytic functions, I added a field that says how much time passed between the current and the previous record. As I said, the number are very large in most cases. The cases where there's some records with small numbers are the cases we are looking for. It says that something important happended. Using this method I found eight out of nine (!!) which implemented in the system.

Incremental Load

In an early stage of the project we realized that it will be very "unwise" to load all the people all the time. As I mentioned, most of the people don't change every week. The problem is that incremental load of the diary ("log") of people is not simple at all. The first reason is that we need to update the end_effective_date and ind_current of the last row, but that's not difficult. The second and main reason is that updating and adding one record is not enough: in case that the man was in a middle of a ST we need to re-calculate the ST. This is why we decided to completely delete the records of the people that changed (in one or more appendixes) and rebuild them from scratch. This is why we run the ETL only on for the people that changed. This solution is very simple and easy to implement. All we need to do is to delete these people's records from all three diaries and then run the ETL only on them. Thus, the ETL process is very short (half hour total) in every night and only the first run (on all people) is (very...) long.

Interesting Reports

I'll show here some interesting reports I built for the users in order to stimulate your thinking:

  • Organizational Pyramid - the ratio, over the years, between the number of different populations. For example, the ratio of senior lecturers, lecturers and students. The report is shown as stacked graph bar. The graph answers the basic question: Do we have enough students to make them enough teachers in the coming years? Do we have enough lecturers to make them enough senior lecturers in the coming future? In this picture I built a very similar graph in Excel that shows how many students of every year we have.

  • Students Rift - For every faculty, the gap between the number of students that the university wanted to have and the actual number of students. You can change the time parameter and see the screen as it was on another date. You can also drill-down on every faculty and see the raw numbers.
  • Number of Promotions & Retirements - For every year, how many promotions & retirements were made. You can drill-down the numbers and see the names and details of the people.
  • Grades Graph - Classic line graphs othat shows grades for a lecturer or faculty over the timeline. Excellent tool to discover interesting trends such as professors who write too difficult exams or decrease in the academic level of a faculty.

To conclude: I hope you enjoyed the posts about BI HR. I got some mails regarding these posts and I'm glad to hear that I'm helping people here in Israel and even over the world.

Monday, March 07, 2011 9:31:10 PM (Jerusalem Standard Time, UTC+02:00)