This post is the fourth in a series about HR BI systems. In the previous post I've shown to advanced model of the solution, the ERD of the datawarehouse and I shown how to answer most of the business questions using that model. In this post I'll show a more complex issue which is smart records filtering and Super Transactions. I will introduce you to this subject and show you how I implemented it.
To better understand this post, I highly recommend you to read the previous post first. Let's summarize the last post: We reached to the conclusion that in order to answer most of the business question regarding HR, we need to build a very wide Fact table (meaning that it will have many fields) that contains all the relevant fields from the appendixes. I introduced you to the problem of irrelevance and I shown the way to solve it using the N/A value instead of null when the appendix hasn't started yet.
Actually, after doing what I described in the last post we finished 90% of the project. Now, only the other 90% remains...
The thing is that another main requirement remains (although basically it was our idea). The reason for which I didn't mention it in the previous posts is that I didn't want to bring in this complex issue in the first steps. I wanted you, my reader, to stay focused on the main structure of the solution. The requirement is to clean and improve (from the data point-of-view) the personal diary records and identify personal processes. I will explain later what this means. Note that at this point the example of the university is not so relevant so I apologize for the weird examples.
Cleaning Records
For example, let's say that we have one lecturer (or more) that has swift changes in his salary: During the month his salary changes but in the end of the month it always goes back to the original salary, meaning that he always gets the same salary in his bank account. I know it sounds weird, but I've seen a lot more confusing cases in financial systems. If we'll bring the lecturer's data to the DWH as it is, we'll have a number of serious problems:
-
In case that the user filter the report on the date where the lecturer has the temporary salary, the user will get a wrong result because he won't see the "real" salary ("real" is a problematic word in this case. We'll discuss it later).
-
The rapid changes in the lecturer's salary appendix will cause great growth in its diary. This is problematic because of disk space, but the main problem is that the diary that will be shown to the end-user will be "junky" with a lot of unnecessary records which will prevent the end-user from seeing the true changes in the diary. In other words, we want a cleaner personal diary.
-
On the other hand, there are other users (especially business analysts) that want to analyze exactly these cases so we can't just delete the records...
Super Transactions
Super Transaction, or ST, is a personal process or change in the man's life which reflects some appendixes and can happen on more than one day. For example, let's say that we have a tutor that got a promotion and becomes a lecturer. By doing that, he changes his Position (tutor -> lecturer), Courses (now he teaches other courses) and the most important of all - the Salary. These changes won't have to happen exactly on the same day. Its position can be changed when he got the announcement but the salary will change only in the end of the month. From the technical point of view, we can say that the man only had some changes in some of his appendixes. But from the business point of view, he has a serious career change aka he been over a process, or a ST.
The requirement is to identify and flood these STs and even let the users investigate and filter by them (You can read more about it in Kimball's book).
So, how we'll make it real? In particular, how we'll let the users see both the original and the cleaner data? First things first, so let's talk about the big picture - the model of the solution.
General Model
To fulfill these requirements we'll define three work modes. The user can switch between them at any time.
- Full Mode - In this mode, the user see exactly the data of the source systems. The records filtering and the ST detection won't take part here.
- Filtered Mode - In this mode the user will have cleaner records and he will see the ST.
- Integrated Mode - The user will see both the original data and the filtered records, so he'll can compare them and have a better understanding of how the system works.
Let's wear the hat of the DWH analyst. If we'll want to add a filter to the table according to the work mode that user has chosen, we'll have a major problem. Every record can be belonging to more than one mode. A record that won't be filtered belongs to every mode. A filtered record only belongs to the filtered and the integrated mode. ST record (we'll talk about it later) also belongs to the filtered and the integrated mode. The naive solution says that we'll add three indicators (boolean columns) that will say for which work mode this record belongs and we'll filter the table using these columns according to the work mode the user has chosen. The problem is it will cause great performance problems so we have to create indexes for these columns. But, creating indexes for boolean records is not very affective because the variance is extremely small.
The solution we found is to split the diary into three tables, one table for every work mode. When the user changes the work mode, the connection will change and the system will take the data from another table. The great benefit in this solution is the performance gain. The query is mapped on the appropriate table and we don't need to filter according to the work mode. In order to implement this, you need a reporting tool that can dynamically change the connection, but I believe that any reporting tool can do it after some workarounds. In Reporting Services, you can do it by defining a Stored Procedure as the Data Source. The SP will get the required schema/table as a parameter and use it in the query (you can see code here). In OBIEE, you can do it by using value-based fragmentation (instructions here). The disadvantage is that you take unnecessary disk space, but we prefer performance over space.
The ETL process now has three targets, one for each work mode. The filters inside the ETL process define in which table, or work mode every record will be loaded.
Cleaning Records
Let's start from the most important thing: It's very important to make the customers understand that our BI system is not a artificial intelligence system and it can't guess which records need to be filtered. We need to make clear that the system will work by predefined rules. This is why the preliminary step is the most difficult: We need to define the rules, test them, improve them, test again and so on. A good example is that: If a person changes its status (in almost any appendix) and then changes it back or change it into another status, this is a "dirty" record which needed to be cleaned. This simple rule can filter a lot of noise in the data. It happens a lot of times that a user of the source system changes something, see that he has mistaken and turning things back. The problem is that some of the system only add rows and don't update them is such cases. This is why it makes sense to filter these records.
After we make sure that we and the customers understand each other, we can start with the implementation. This can be very easy or very difficult, depends on the ETL tool you're using and especially on the complexity of the rules you defined. While the ETL tools usually look on one record at a time, we need to compare the record with its predecessor, the record after it and sometimes even more. In this step, I recommend you to stop, take a paper & pen and start to think how to implement it before you're opening your ETL tool. The most important thing is to check that the results satisfy your expectations. Maybe you'll have performance problems here, but you can come back and handle them in the final steps of the project (personally, I prefer dealing with data correctness in the first steps and performance in the last steps). As I said, after you decide (in the ETL process) whether it's a regular, filtered or ST record, you'll route it to the appropriate target(s).
Super Transactions (ST)
This is the most difficult part of the project. Take a deep breath.
I'll remind you what it is from the business view and this time also from the technical point of view. ST is a personal change, or process that happened to a man. This change will often be reflected in more than one appendix and can happen on more than one day. Remind yourself with the tutor that became lecturer. The salary change happened weeks after the position change. The requirement is to flood that a big change happened here and not only some changes in some appendixes.
Let's go on with the technical stuff. There are some basic questions regarding ST:
- What will happen with the records that are contained inside the ST? contained record is a record which is a part of the change (salary change or course change in our example). If all the changes happened on the same day, will we mark the record as a ST record or maybe we'll add a new ST record? If the changes happened on more than one day, will we add a new ST record or maybe we'll mark one of the records as the ST record? Which one? What will happen with the rest of the contained records?
- What will happen with records that occurred on the same time as the ST but don't belong to it? For example, let's say that when the tutor became lecturer he also became a father (Mazal Tov!). The diary will have a new record for this change, but it doesn't belong to the ST. What we'll do with this record?
- Will we run the ST detection before or after the record cleaning? It will change the results dramatically.
Enough with questions. let's start with the answers. We thought and got to the conclusion that there is a strong connection between the work modes and the ST.
We defined the full mode as the mode where the data is exactly the same as the source data. This is why we won't clean records here and we won't mark any ST records. We want to keep the balance with the source systems. Our greatest threat is that the users will say: "Hey, this is not what I see in my (source) system. This whole BI is crap!!!"
On the other hand, the filtered mode is where our system becomes smarter. Here we can add the record filtering and the ST detection. This is why here we'll mark ST records. Why mark and not add a new record? Because this will break our most basic rule - one man cannot have two records with the same date. This is not just a technical problem (we can add Is_ST column and add it to the table's primary key). This is a conceptual principal. We added a new column named ST_Name.(If it's null then the record is not a ST record).
Which record will we mark as the ST record? Very simple: As part of the design stage, we'll decide (along with the customers) for each ST which is the leading record and we'll mark it. Often, this is the record that will help us to detect the ST. Going back to our example, the ST record is the one where the tutor changed its position. In this record, we'll make ST_Name=Position_Change. Don't forget to mark the contained records as filtered and in the what_changed columns in the ST record, add the other contained appendixes. Back to the example, Write in the what_changed column that also the courses and salary changed as well. We'll also update these appendixes' column in the ST record with the new values.
What about not-contained records (such as the family status change)? Our solution completely covers that. The ST detection only handles with the appendixes that we defined as contained in the ST. Even if the change happened in the same date as the position change, it won't affect anything at all.
Regarding the order of the actions (ST detection or records filtering) - We thought about it and came to the conclusion that it is better to first clean the records because the ST detection process needs to see the best and most clear picture as it can for a better ST detection. We've been glad to see it's right in the tests we did.
Summary
The record cleaning and the ST detection that the BI system does is a quantum leap. The system does not only reflect the source data, but it also add an "artificial intelligence" layer which helps the user navigate the mixture of data. As a matter of fact, this makes the system turn data into knowledge. On the other hand, it doesn't forces itself on the end-user and it enables him to see the original source data. It's important to say that the design is a dynamic and "breathing" process that doesn't end on Day 1 of the system. As time goes by, things and business rules are getting clearer and better so we come back and sharp the rules.
In the next post (the last of the series in my opinion) I'll discuss even more interesting issues such as differential load, how to find ST in the design phase and some of the most interesting reports you can do with our HR DWH. Stay tuned.