
One of the basic requirements of a modern BI system or Data Warehouse is to be non-volatile, meaning that the DWH will save the history. In this post I'll explain how to implement it in the DWH while assuming that the history can be changed and (the more interesting part) how to implement it in Analysis Services.
Before starting with history saving in the DWH, I need to say that this has nothing to do with Slowly Changing Dimensions. You can work with Type 1, 2 or 3. All you need to take care of is that the fact table will be consistent with the dimension tables. Assuming that the history can change (this is the scenario in my organization), we need to extract all the fact data every time. If you work with staging level/environment/layer you can use it to calculate which records has been changed or inserted and only upsert them to the DWH itself (it's more secure). In our case, we'll work without such staging level so we need to load all the fact data in every load to the DWH.
Loading the fact in the ETL is very simple - we'll save the datetime of the execution start, round it and put it in every row in the fact table. We can call this field CreateTime. You probably ask yourself why we need to round it. The answer is that we don't want the time dimension to have every possible value of datetime. It's huge and the users don't need that kind of accuracy. That's why I round it in my projects to half-hour, meaning that 14:27 becomes 14:00 and 14:50 becomes 14:30. Let's look at an example of such fact table:
|
RowNum |
Dimension1 |
Dimension2 |
Measure1 P> |
Measure2 P> |
CreateTime |
|
1 |
X |
A |
4 |
20 |
03/03/2010 15:00 |
|
2 |
X |
B |
56 |
19 |
03/03/2010 15:00 |
|
3 |
Y |
B |
2 |
3 |
03/03/2010 15:00 |
|
4 |
Y |
B |
1 |
10 |
03/03/2010 15:00 |
|
5 |
Y |
C |
7 |
5 |
03/03/2010 15:00 |
|
6 |
X |
A |
4 |
20 |
03/03/2010 15:30 |
|
7 |
X |
B |
56 |
40 |
03/03/2010 15:30 |
|
8 |
Y |
B |
2 |
3 |
03/03/2010 15:30 |
|
9 |
Y |
B |
10 |
10 |
03/03/2010 15:30 |
|
10 |
Y |
C |
7 |
5 |
03/03/2010 15:30 |
Here we can see two executions of the ETL: The first between 15:00 and 15:00, the second between 15:31 and 15:59. Note the differences in rows 2 vs. 7 and 4 vs. 9. This means that the source table(s) has changed between the executions.
In the DWH, we create foreign keys from Dimension1 and Dimension2 to their dimension tables. We'll also create FK between CreateTime and the time dimension. I won't get into the time dimension design here, you can read about it a past post. I assume that the time dimension has the hierarchy Year -> Month -> Day -> Hour -> Half-Hour. It seems that now we have a good fact table for use in Reporting Services (or any other reporting tool). Create a parameter called Extraction Time/ETL Time and assign it values from distinct on CreateTime. The user can pick a date and the report will filter on this date. Perfect.
The problematic part starts in Analysis Services. Only one extraction time can be the current member. There is no sense in summing or aggregating more than one extraction time. There are two ways to deal with it:
The wrong way - I'm writing this down so you'll learn from this
design mistake. We created a calculated member which takes the last extraction time:
Create Member CurrentCube.[Extraction Time]. [Calendar].[All].[Last Data Time]
AS
Tail ([Extraction Time].[Calendar].[HALF HOUR].Members,1).Item(0),
Visible = 1;
In every Panorama view, we sliced the Extraction time on the [Last Data Time] member, which will show the user the data of the last ETL run. Our basic assumption was that the user wants to see the current data and when he'll want to see the historical data he will pick another member from the Extraction time dimension. This is the point we missed and it soon enough bumped into our face.
We forgot that every system, even BI system which is built for analysts should be intuitive and idiot-proof. The users picked members from the day level in the Extraction time dimension. When we asked them why they did it they said that "we wanted to see the data from 03/03/2010 and suddenly the numbers got bigger!". That happened, of course, because the member 03/03/2010 is the sum of 03/03/2010 15:00 and 03/03/2010 15:30. The solution is that no matter which member the user picks in the Extraction time dimension, he should see only one leaf (half-hour level) member at a time. This takes us to
The right way - We set the [Last Data Time] member to be visible = 0 (I kept it for debug purposes) and I written MDX script which I call NonLastEmptyLeaf. This script makes sure that no matter which member in the Extraction time hierarchy will be picked, the user will be sliced only on the last non empty leaf. For example, if the user picks 03/03/2010 he will see the data of 03/03/2010 15:30. This is what will happen also when he'll slice on 03/2010, 2010 and even the [All] member. This is the MDX script:
Freeze([Extraction time].[Calendar].[All]);
Scope(
Descendants (
[Extraction time].[Calendar]. [All],
0,
SELF_AND_AFTER
)
);
this = Tail (
Filter(
Descendants(
[Extraction time]. [Calendar].CurrentMember,
[Extraction time].[Calendar].[HALF HOUR],
LEAVES
),
[Measures].CurrentMember > 0
)
).Item(0);
End Scope;
Now, there is no meaning of [Last Data Time] member because slicing on the [All] member will show the current data, meaning that no slicing at all will show the current data because [All] is the default member.
To conclude, history browsing can be dangerous if we let the users the option to slice on more than one data time. The open question that I have is how can I prevent the user from picking more than one member in this hierarchy in Panorama views? Should I even try to do that or should I count on his minimal common sense?