Wednesday, September 14, 2011

You must be wondering why this blog wasn't updated for a long time. The reason is that I started a new job. :-) Along with other stuff, I'm playing with QlikView and I'd like to share some interesting ideas.

In QlikView, The list boxes which filters the entire sheet (or document) is a very common thing. I thought that if we have a box list with a small amount of values, it might be more beautiful to replace each value with an image. For exmaple, let's say that we have three divisions in the company. Instead of putting a list-box with their names, we can put three images with their logos and make that a click on the image will filter on these values (very simple with the Actions tab). The problem is that the user won't know on which division the sheet is filtered on. That's why we need to make the other images more transparent. Using macro, that's very easy and the result is awsome:

sub SetTransparentImage
  set image = ActiveDocument.GetSheetObject("ImageIDHere")
  set imageProps = image.GetProperties
  imageProps.Layout.BkgAlpha = 50
  image.setProperties imageProps
end sub

Now, put that in the macro script inside your document. In order to add images to your sheet, add new Text boxes and set their background to the image you want. In the Actions tab, set the desired value on click and also run this subroutine.

Enjoy.
 | 
Thursday, September 15, 2011 12:23:04 AM (Jerusalem Daylight Time, UTC+03:00)
 Thursday, June 23, 2011

The problems: A. Storage costs money, much money. B. Today's storage is very slow and we want fast retrieve of data.

The Solution: Take the commonly-used data and put it in very fast and expensive storage. Take the less used data and put it on slower and cheaper storage. The overall budget should be much lower.

Link (pdf)

Friday, June 24, 2011 3:42:33 AM (Jerusalem Daylight Time, UTC+03:00)
 Wednesday, June 15, 2011

For the google search: Informatica service start and stops

Yeah, that's exactly what happened to me. Everything looked just fine until the last step of the installation, where the installer pings the new integration service, but there is no answer.
Looking at the services manager, I see that the InformaticaService is stopped. When I restart it, it stops after 5-10 seconds. I thought it's because of the service's account settings, but it's not. After some time of search and re-attempts to make it work, this is what I can say if you're stuck as I was:

  • While you're in the installation wizard, on the node configuration step, press the "Advances Settings" button. There, you'll see the min. and max. port numbers. Go back to the node configuration and change the port number so it will be in that range (why it wasn't in the first place? ...)
  • The Windows PATH environment variable must contain the informatica server binary folder: <INFORMATICA_INSTALL_DIR>\Server\bin
  • If you're still stuck, look after the service.log file in the informatica directory. Look at the error and google it.
Hope I saved somebody a few hours.

Thursday, June 16, 2011 6:57:43 AM (Jerusalem Daylight Time, UTC+03:00)
 Thursday, June 02, 2011

Yes, and it's great!

Finally, the semantic web is here. This means that from now on, the search engines won't have to guess what is the content of the (our?) web pages. The answer will be there, behind the scenes.
Technically, we (the web developers) will add new properties to our HTML tags that will describe the content of the text. For example, you can add a property to your DIV or SPAN that will tell the search engine that this is an article, creative idea or a critique.

What is the meaning of this for us, the web developers or independent bloggers? That depends. If you want to make your blog (or site) to appear in more relevant search results, you should add these properties. Yes, you'll have to manually edit your HTML or to make a template. Of course you don't have to do this.
For those who want to have a deeper understanding of this subject, go to the common standard site - Schema.org

Friday, June 03, 2011 6:08:33 AM (Jerusalem Daylight Time, UTC+03:00)
 Monday, May 23, 2011
After six years of work with SSAS, I decided to start re-reading the best book on the subject - Microsoft SQL Server 2005 Analysis Services. Why read about 2005 when there are more recent versions? That's because this is the most comprehensive book on the subject and the only one that was written by the SSAS's development team. The book gives you a better understanding of the technlogy and shows you the motivation for every architectual decision of the creators. While reading, I come across subjects that I wanted to post about and I believe that along with reading the book I'll post more.

The book cover

At the university, I was taught that half crazy is more dangerous than crazy. The reason is simple: You know that you should always be careful from the crazy guy, but regarding the half crazy person you don't know when to be careful and when to learn from him. Information systems are the same and especially DWH: There is no half truth or not precise truth. If you know that there's a data problem - bubble it up and don't show you user half a truth (because it's also half a lie).

Every time we bind a DB column to a dimension attibute, a DataItem object is created. The DataItem defines the connection between the conceptual and the physical model. For example, it defines from which table and column the data is taken. Along with other properties, there is one important property called NullProcessing which defines how the server behaves with null data. The default is ZeroOrBlank, meaning that the null data will be shown to the end-user as zero (if the DataType is not string) or empty string (if the DataType is string). The problem is in case where the attribute is a parent in an hierarchy. This is what happened:
We defined "Sales Amount" measure under the hierarchy Category -> Sub-Category -> Product. By mistake, one of the members in the categoty level got null value and it was shown as zero in the cube. The user was used to see the Shoes sub-category under the Clothes category. One day, he seen that the shoes sales amount was decreased dramatically - He seen products under the Shoes sub-category which is under the Clothes category but the sub-total was decreased. In the bottom of the table other products were "hidden" with the Shoes sub-category but with the zero category. The reason is that some of the shoes products were given null categories and in the report the shoes products were split between the Clothes category and the zero category. The user got half truth and he interpreted it as wrong data and that should never happen in a BI system.

What can we do? In my opinion, in most cases we need to set NullProcessing as Error, which will throw an error and prevent us from showing wrong data. This is how we (the IT people) will know about the problem before the end-users.
There are more options to NullProcessing:

  • Preserve - keeps the null value with no replacement. As I shown, this is not recommended. Furthermore, preserve uses more system resources.
  • Unknown Member - replaces the null with the unkown member you defined in the dimension. I recommend using this option only if the users know that the unknown member exists and what it means.
  • Automatic - The server will select the most appropriate behaviour according to the context. I don't really understand what it means and the writers don't explain it, but generally it is recommended to define deterministic behavior to your system.

Does it mean that now we'll go over all our data bindings and change their NullProcessing proprety? Of course not ! We need to think where this behavior has a bad effect and where it doesn't matter. For example, in a key attribute of a dimension which is also a leaf level of an hierarchy, the ZeroOrBlank option is quite good.

Monday, May 23, 2011 8:56:43 PM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, March 29, 2011

We're doing something very useful at our workplace and because I believe in collaboration I would like to share it with you, dear readers. I'm sure it will help a lot of system analysts and developers.

When we write down the design of the ETL process, especially regarding complex fact tables, we find it necessary to write it down as a flowchart. This creates a simple and common language between the designer/analyst and the developer and even with the customer. It's much easier for everyone to read a flowchart rather than reading a long and complex document. Therefore, we create diagrams such as these using Visio (this diagram was created using Google Docs, but it's preety much the same):

I only brought here the beginning of the diagram. The real one is 3 times bigger.
Note that every leaf in the tree has its own number. These numbers indicates the record's "case number". Now for the trick: In the target table (fact, dimension or anything else) we add a new column that contains the number of the case, according to the numbers shown in the diagram. This field is not exposed to the end-user. The pros of this trick are very significant:

  • Already in the early stages of the implementation we can check the statistics, meaning how many times each case happens and see if we got what we expected. This can help us find new things that we (and sometimes even the users) weren't aware of and find bugs in early stage of the development.
  • Each record can "tell its story" and tell us "why it was created". This saves us a lot of effort when we want to question the data and find the source of the problems.
  • During the tests we check two different things: Whether each record was "identified" correctly by the ETL (by checking the case number) and whether the values entered into the other fields are correct. While once we used to check only the values correctness and then find the problems, now we can detect the problems much faster.
After we came up with this idea, I thought of another one: Why won't we expose this new field to the users/analysts and let them "enjoy" this knowledge? Maybe these statistics can also help them with the data analysis? This requires us to build "cases" dimension and add a description for every case number. I don't think that this is a good idea in any circumstances, but you should know that it's an option.

Tuesday, March 29, 2011 11:20:39 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, March 07, 2011

This post is the fifth (and last, I think) in a series about HR BI systems. In the previous post I've shown how to make the BI system smarter by cleaning records and finding Super-Transactions. In this post I'll complete a few last points and introduce you to some interesting reports I built.

Finding Super Transactions

In the last post, I introduced you the concept of Super Transactions (ST). I explained how to make the ETL process find them in the raw data and how to model them in the DWH and let the end-users use them in their analysis. However, I didn't explain how to find them in the initial design phase.

Objective: Find the material processes that happens to many people.
The main problem: There are many people and we don't know where to start looking! We should think about a common denominator to all the processes in order to find them.
The key observation: People are boring. Meaning, in terms of their statuses or various trailers. Think about your personal diary (in terms of what we talked about in the previous posts), or the "log" of your life. Think how many times you changed your martial status, salary, workplace status, etc. Yes, you have no more than a few records. You change something only once in a couple of years (in the best case).
On the other hand, when an employee undergoes a material change (ST), there are many changes that happens (almost) at once in different appendixes (or subjects). When the student changes his position in the university, he changes in short time his position, salary, courses and more.
Let's show it in a simple graph. The X axis is the time line and the Y axis is the number of appendixes that changed:


This shows us that most of the time our life is static and suddenly you find there are "peaks" that indicates that something "big" happened. This is where we'll look for the STs.
How we'll find them? I'll take you back to the advanced modeling of the project. For every record in the diary we added end_effective_date column which has the date of the next record minus one day. That enable us to write between statements and make our life more simple. Back to our case: using analytic functions, I added a field that says how much time passed between the current and the previous record. As I said, the number are very large in most cases. The cases where there's some records with small numbers are the cases we are looking for. It says that something important happended. Using this method I found eight out of nine (!!) which implemented in the system.

Incremental Load

In an early stage of the project we realized that it will be very "unwise" to load all the people all the time. As I mentioned, most of the people don't change every week. The problem is that incremental load of the diary ("log") of people is not simple at all. The first reason is that we need to update the end_effective_date and ind_current of the last row, but that's not difficult. The second and main reason is that updating and adding one record is not enough: in case that the man was in a middle of a ST we need to re-calculate the ST. This is why we decided to completely delete the records of the people that changed (in one or more appendixes) and rebuild them from scratch. This is why we run the ETL only on for the people that changed. This solution is very simple and easy to implement. All we need to do is to delete these people's records from all three diaries and then run the ETL only on them. Thus, the ETL process is very short (half hour total) in every night and only the first run (on all people) is (very...) long.

Interesting Reports

I'll show here some interesting reports I built for the users in order to stimulate your thinking:

  • Organizational Pyramid - the ratio, over the years, between the number of different populations. For example, the ratio of senior lecturers, lecturers and students. The report is shown as stacked graph bar. The graph answers the basic question: Do we have enough students to make them enough teachers in the coming years? Do we have enough lecturers to make them enough senior lecturers in the coming future? In this picture I built a very similar graph in Excel that shows how many students of every year we have.

  • Students Rift - For every faculty, the gap between the number of students that the university wanted to have and the actual number of students. You can change the time parameter and see the screen as it was on another date. You can also drill-down on every faculty and see the raw numbers.
  • Number of Promotions & Retirements - For every year, how many promotions & retirements were made. You can drill-down the numbers and see the names and details of the people.
  • Grades Graph - Classic line graphs othat shows grades for a lecturer or faculty over the timeline. Excellent tool to discover interesting trends such as professors who write too difficult exams or decrease in the academic level of a faculty.

To conclude: I hope you enjoyed the posts about BI HR. I got some mails regarding these posts and I'm glad to hear that I'm helping people here in Israel and even over the world.

Monday, March 07, 2011 9:31:10 PM (Jerusalem Standard Time, UTC+02:00)
 Wednesday, January 05, 2011

This post is the fourth in a series about HR BI systems. In the previous post I've shown to advanced model of the solution, the ERD of the datawarehouse and I shown how to answer most of the business questions using that model. In this post I'll show a more complex issue which is smart records filtering and Super Transactions. I will introduce you to this subject and show you how I implemented it.

To better understand this post, I highly recommend you to read the previous post first. Let's summarize the last post: We reached to the conclusion that in order to answer most of the business question regarding HR, we need to build a very wide Fact table (meaning that it will have many fields) that contains all the relevant fields from the appendixes. I introduced you to the problem of irrelevance and I shown the way to solve it using the N/A value instead of null when the appendix hasn't started yet.

Actually, after doing what I described in the last post we finished 90% of the project. Now, only the other 90% remains... ;-) The thing is that another main requirement remains (although basically it was our idea). The reason for which I didn't mention it in the previous posts is that I didn't want to bring in this complex issue in the first steps. I wanted you, my reader, to stay focused on the main structure of the solution. The requirement is to clean and improve (from the data point-of-view) the personal diary records and identify personal processes. I will explain later what this means. Note that at this point the example of the university is not so relevant so I apologize for the weird examples.

Cleaning Records

For example, let's say that we have one lecturer (or more) that has swift changes in his salary: During the month his salary changes but in the end of the month it always goes back to the original salary, meaning that he always gets the same salary in his bank account. I know it sounds weird, but I've seen a lot more confusing cases in financial systems. If we'll bring the lecturer's data to the DWH as it is, we'll have a number of serious problems:

  1. In case that the user filter the report on the date where the lecturer has the temporary salary, the user will get a wrong result because he won't see the "real" salary ("real" is a problematic word in this case. We'll discuss it later).
  2. The rapid changes in the lecturer's salary appendix will cause great growth in its diary. This is problematic because of disk space, but the main problem is that the diary that will be shown to the end-user will be "junky" with a lot of unnecessary records which will prevent the end-user from seeing the true changes in the diary. In other words, we want a cleaner personal diary.
  3. On the other hand, there are other users (especially business analysts) that want to analyze exactly these cases so we can't just delete the records...

Super Transactions

Super Transaction, or ST, is a personal process or change in the man's life which reflects some appendixes and can happen on more than one day. For example, let's say that we have a tutor that got a promotion and becomes a lecturer. By doing that, he changes his Position (tutor -> lecturer), Courses (now he teaches other courses) and the most important of all - the Salary. These changes won't have to happen exactly on the same day. Its position can be changed when he got the announcement but the salary will change only in the end of the month. From the technical point of view, we can say that the man only had some changes in some of his appendixes. But from the business point of view, he has a serious career change aka he been over a process, or a ST.

The requirement is to identify and flood these STs and even let the users investigate and filter by them (You can read more about it in Kimball's book).

So, how we'll make it real? In particular, how we'll let the users see both the original and the cleaner data? First things first, so let's talk about the big picture - the model of the solution.

General Model

To fulfill these requirements we'll define three work modes. The user can switch between them at any time.

  1. Full Mode - In this mode, the user see exactly the data of the source systems. The records filtering and the ST detection won't take part here.
  2. Filtered Mode - In this mode the user will have cleaner records and he will see the ST.
  3. Integrated Mode - The user will see both the original data and the filtered records, so he'll can compare them and have a better understanding of how the system works.

Let's wear the hat of the DWH analyst. If we'll want to add a filter to the table according to the work mode that user has chosen, we'll have a major problem. Every record can be belonging to more than one mode. A record that won't be filtered belongs to every mode. A filtered record only belongs to the filtered and the integrated mode. ST record (we'll talk about it later) also belongs to the filtered and the integrated mode. The naive solution says that we'll add three indicators (boolean columns) that will say for which work mode this record belongs and we'll filter the table using these columns according to the work mode the user has chosen. The problem is it will cause great performance problems so we have to create indexes for these columns. But, creating indexes for boolean records is not very affective because the variance is extremely small.

The solution we found is to split the diary into three tables, one table for every work mode. When the user changes the work mode, the connection will change and the system will take the data from another table. The great benefit in this solution is the performance gain. The query is mapped on the appropriate table and we don't need to filter according to the work mode. In order to implement this, you need a reporting tool that can dynamically change the connection, but I believe that any reporting tool can do it after some workarounds. In Reporting Services, you can do it by defining a Stored Procedure as the Data Source. The SP will get the required schema/table as a parameter and use it in the query (you can see code here). In OBIEE, you can do it by using value-based fragmentation (instructions here). The disadvantage is that you take unnecessary disk space, but we prefer performance over space.

The ETL process now has three targets, one for each work mode. The filters inside the ETL process define in which table, or work mode every record will be loaded.

Cleaning Records

Let's start from the most important thing: It's very important to make the customers understand that our BI system is not a artificial intelligence system and it can't guess which records need to be filtered. We need to make clear that the system will work by predefined rules. This is why the preliminary step is the most difficult: We need to define the rules, test them, improve them, test again and so on. A good example is that: If a person changes its status (in almost any appendix) and then changes it back or change it into another status, this is a "dirty" record which needed to be cleaned. This simple rule can filter a lot of noise in the data. It happens a lot of times that a user of the source system changes something, see that he has mistaken and turning things back. The problem is that some of the system only add rows and don't update them is such cases. This is why it makes sense to filter these records.

After we make sure that we and the customers understand each other, we can start with the implementation. This can be very easy or very difficult, depends on the ETL tool you're using and especially on the complexity of the rules you defined. While the ETL tools usually look on one record at a time, we need to compare the record with its predecessor, the record after it and sometimes even more. In this step, I recommend you to stop, take a paper & pen and start to think how to implement it before you're opening your ETL tool. The most important thing is to check that the results satisfy your expectations. Maybe you'll have performance problems here, but you can come back and handle them in the final steps of the project (personally, I prefer dealing with data correctness in the first steps and performance in the last steps). As I said, after you decide (in the ETL process) whether it's a regular, filtered or ST record, you'll route it to the appropriate target(s).

Super Transactions (ST)

This is the most difficult part of the project. Take a deep breath.
I'll remind you what it is from the business view and this time also from the technical point of view. ST is a personal change, or process that happened to a man. This change will often be reflected in more than one appendix and can happen on more than one day. Remind yourself with the tutor that became lecturer. The salary change happened weeks after the position change. The requirement is to flood that a big change happened here and not only some changes in some appendixes.
Let's go on with the technical stuff. There are some basic questions regarding ST:

  1. What will happen with the records that are contained inside the ST? contained record is a record which is a part of the change (salary change or course change in our example). If all the changes happened on the same day, will we mark the record as a ST record or maybe we'll add a new ST record? If the changes happened on more than one day, will we add a new ST record or maybe we'll mark one of the records as the ST record? Which one? What will happen with the rest of the contained records?
  2. What will happen with records that occurred on the same time as the ST but don't belong to it? For example, let's say that when the tutor became lecturer he also became a father (Mazal Tov!). The diary will have a new record for this change, but it doesn't belong to the ST. What we'll do with this record?
  3. Will we run the ST detection before or after the record cleaning? It will change the results dramatically.

Enough with questions. let's start with the answers. We thought and got to the conclusion that there is a strong connection between the work modes and the ST.
We defined the full mode as the mode where the data is exactly the same as the source data. This is why we won't clean records here and we won't mark any ST records. We want to keep the balance with the source systems. Our greatest threat is that the users will say: "Hey, this is not what I see in my (source) system. This whole BI is crap!!!"

On the other hand, the filtered mode is where our system becomes smarter. Here we can add the record filtering and the ST detection. This is why here we'll mark ST records. Why mark and not add a new record? Because this will break our most basic rule - one man cannot have two records with the same date. This is not just a technical problem (we can add Is_ST column and add it to the table's primary key). This is a conceptual principal. We added a new column named ST_Name.(If it's null then the record is not a ST record).

Which record will we mark as the ST record? Very simple: As part of the design stage, we'll decide (along with the customers) for each ST which is the leading record and we'll mark it. Often, this is the record that will help us to detect the ST. Going back to our example, the ST record is the one where the tutor changed its position. In this record, we'll make ST_Name=Position_Change. Don't forget to mark the contained records as filtered and in the what_changed columns in the ST record, add the other contained appendixes. Back to the example, Write in the what_changed column that also the courses and salary changed as well. We'll also update these appendixes' column in the ST record with the new values.

What about not-contained records (such as the family status change)? Our solution completely covers that. The ST detection only handles with the appendixes that we defined as contained in the ST. Even if the change happened in the same date as the position change, it won't affect anything at all.
Regarding the order of the actions (ST detection or records filtering) - We thought about it and came to the conclusion that it is better to first clean the records because the ST detection process needs to see the best and most clear picture as it can for a better ST detection. We've been glad to see it's right in the tests we did.

Summary

The record cleaning and the ST detection that the BI system does is a quantum leap. The system does not only reflect the source data, but it also add an "artificial intelligence" layer which helps the user navigate the mixture of data. As a matter of fact, this makes the system turn data into knowledge. On the other hand, it doesn't forces itself on the end-user and it enables him to see the original source data. It's important to say that the design is a dynamic and "breathing" process that doesn't end on Day 1 of the system. As time goes by, things and business rules are getting clearer and better so we come back and sharp the rules.

In the next post (the last of the series in my opinion) I'll discuss even more interesting issues such as differential load, how to find ST in the design phase and some of the most interesting reports you can do with our HR DWH. Stay tuned.

Wednesday, January 05, 2011 6:29:09 PM (Jerusalem Standard Time, UTC+02:00)