Thursday, July 01, 2010

This is the fifth and last post regarding Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the last part of the seminar which discussed SSRS. As you can see, most of the tips are relevent for every reporting tool.

  • Scale-Out instead of Scale-In. Meaning that it's better to have multiple SSRS servers on one DB then bigger and better single SSRS server.
  • One report should not return all the data. Define and build your report to answer specific answer.
  • Use subscriptions. Minimile impact on performance.
  • Use cache executions or at least snapshots.
  • Visualization (see also my post about dashboard design)
    • Use same colors for for the same business units
    • Use images for better data recognition (e.g countries' flags)
  • Monitor the SSRS server. Look at the trace logs, the execution log and the performance counters. You can use Scrubs(free SSRS performance reports).
Thursday, July 01, 2010 7:31:33 PM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, June 22, 2010

This is the fourth post about Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the third part of the seminar which discussed SSIS.

  • Log everything that happens inside the package. Use the SSIS log providers and the event handlers. The effect on performance is not significant.
  • Analyize the log data. You can analize total execution time, SSAS partition processing time and much more.
  • Log detailed error information about rejected rows. You can even log the actual error row.
  • Monitor also the execution of the ETL jobs.
  • Data Profiling is a new feature in SSIS 2008. It's used for analysis of the data's quality in the DWH. Go and learn.
  • Configure everything you can in the package configuration, such as: tasks, containers, variables, connection managers and data flow components. You can use XML config file, config from parent package, registry configuration or even configuration from a table.
  • The Cache Transformation saves data in the cache. It's very useful in lookups.
  • Data types: Make them as "small" as possible (better performance). Be aware of precision issues when using money, float and decimal types.
  • Don't sort in SSIS unless it's absolutely necessary.
  • Sometimes using T-SQL will be faster then running it in SSIS.

Some of the tips are also relevant in Informatica. For example, I always let the DB make the sort whenever possible. Sometimes it's also works with joins. Running packages or procedures can be much more faster than running it in the ETL, but if you look at it from maintenance point of view you'll do it in the ETL. It depends on much factors, such as the time you can have to run the ETL, the knowledge in your team and more.

Tuesday, June 22, 2010 10:40:51 PM (Jerusalem Daylight Time, UTC+03:00)

This is the third post about Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the second part of the seminar which discussed SSAS design, but here I'll focus on near real-time cubes. I decided to write an independent post about near real-time cubes because this subject deserves its own place. By the way, I personally think that this subject is very interesting.

There are several solutions to implement near real-time cubes:

  1. Processing only the last partition. Pros: Trivial and simple. Cons: Might impact users in terms of resources and cache.
  2. Small real-time cube. Pros: Simple and elegant solution. Cons: It's small, so it can't contain much data even if you need to. This is the main reason why people usually don't do it.
  3. Backend processing server. Now it's getting interesting. The backend server is only used for processing. The front-end server(s) is used for querying. Pros: Good performance on query time. Cons: You need to take care of syncing the cubes, using XMLA sync, physical file sync (SAN snapshots) or any other technology. The cache is deleted with every sync.
  4. Flip-Flop. It's like the previous solution, only here we switch between the servers. Pros: It enables you to check your data/cube before making the switch. Good performance on query time. Cons: You need to take care of changing the connection string in the client side or implement NLB solution. It might disconnect users or you need to take care of waiting for the end of the session and only then change the connection string.
  5. ROLAP partition. Pros: No processing at all! Cons: Bad, bad performance.

If anyone have implemented one of these I'll be happy to hear.

Tuesday, June 22, 2010 10:17:34 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, June 14, 2010

This is the second post about Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the second part of the seminar which discussed SSAS design.

  • There's a very common misunderstanding about what is an entity. You can see it when there are a lot of dimensions. If there are "car color", "car manufacturer" and "car creation year" dimensions it seems you've got a problem. If there are dimensions with only key and description attributes (or worse - only one description attribute) then you've got a serious problem.
  • Design to Scale: Often the performance problem are caused by the dimensions and not by the fact/cubes.
  • Don't develop SSAS with hebrew (or any other right-to-left language). Use hebrew only in translations .I've personally got hurt with this. Maintaining MDX with hebrew is hell.
  • Avoid using linked measures. If you have some, consider joining the cubes to one cube.
  • There need to be a balance between the amount of the dimensions and the amount of the cubes. If they are equal, you've got a problem. BIDS will warn you (in the 2008 version) if you have two measure groups with the same dimensionality.
  • Keep the dimension keys as narrow as possible. Dimension keys based on more than one column is not optimal. Use member properties, such as AttributeHierarchyEnabled, AttributeHierarchyOptimized and GroupingBehaviour.
  • Data types is an issue in SSAS. Make sure the data types in the DSV are the same as in the dimension attributes.
  • The SSAS engine works best when the foreign key between the fact and the dimension is based on integer columns. Read more about dimension design here.
  • Attribute relationships and flexible vs. rigid are critical to performance. As I read more and more about SSAS I understand how much it's crucial to performance.
  • Automatic creation of partitions can be a good solution for a wide variety of problems. Note that when deploying your SSAS solution to the server you might override the partitions you created with your code. To solve this, you can use the deployment wizard or to import your SSAS database into solution.
  • Each partition should have up to 50M records or up to 250MB. Optimize the partition query. It's only plain SQL query. Get help from your DBA and get a better performance.
  • 99% Aggregations will not cover 99% of the aggregation possibilities/combinations. The engine chooses statistically-randomly which aggregation to save. Use the usage based wizard.
  • Most of the processing time is the time required for the relational DB to return the results of the SQL query. As before, get help from the DBA, improve this query (note that you can't change it) and get a shorter processing time.
  • To solve MDX performance issues, you need to learn what is the difference between the formula engine and the storage engine. I was suprised to see that many SSAS developers never heard of it.
  • In MDX Scripts, use scoping instead of conditional calculations when possible.
  • Avoid StrToMember or StrToSet functions.
Monday, June 14, 2010 9:34:43 PM (Jerusalem Daylight Time, UTC+03:00)
 Wednesday, March 03, 2010
One of the basic requirements of a modern BI system or Data Warehouse is to be non-volatile, meaning that the DWH will save the history. In this post I'll explain how to implement it in the DWH while assuming that the history can be changed and (the more interesting part) how to implement it in Analysis Services.

Before starting with history saving in the DWH, I need to say that this has nothing to do with Slowly Changing Dimensions. You can work with Type 1, 2 or 3. All you need to take care of is that the fact table will be consistent with the dimension tables. Assuming that the history can change (this is the scenario in my organization), we need to extract all the fact data every time. If you work with staging level/environment/layer you can use it to calculate which records has been changed or inserted and only upsert them to the DWH itself (it's more secure). In our case, we'll work without such staging level so we need to load all the fact data in every load to the DWH.

Loading the fact in the ETL is very simple - we'll save the datetime of the execution start, round it and put it in every row in the fact table. We can call this field CreateTime. You probably ask yourself why we need to round it. The answer is that we don't want the time dimension to have every possible value of datetime. It's huge and the users don't need that kind of accuracy. That's why I round it in my projects to half-hour, meaning that 14:27 becomes 14:00 and 14:50 becomes 14:30. Let's look at an example of such fact table:

RowNum

Dimension1

Dimension2

Measure1

Measure2

CreateTime

1

X

A

4

20

03/03/2010 15:00

2

X

B

56

19

03/03/2010 15:00

3

Y

B

2

3

03/03/2010 15:00

4

Y

B

1

10

03/03/2010 15:00

5

Y

C

7

5

03/03/2010 15:00

6

X

A

4

20

03/03/2010 15:30

7

X

B

56

40

03/03/2010 15:30

8

Y

B

2

3

03/03/2010 15:30

9

Y

B

10

10

03/03/2010 15:30

10

Y

C

7

5

03/03/2010 15:30

Here we can see two executions of the ETL: The first between 15:00 and 15:00, the second between 15:31 and 15:59. Note the differences in rows 2 vs. 7 and 4 vs. 9. This means that the source table(s) has changed between the executions.

In the DWH, we create foreign keys from Dimension1 and Dimension2 to their dimension tables. We'll also create FK between CreateTime and the time dimension. I won't get into the time dimension design here, you can read about it a past post. I assume that the time dimension has the hierarchy Year -> Month -> Day -> Hour -> Half-Hour. It seems that now we have a good fact table for use in Reporting Services (or any other reporting tool). Create a parameter called Extraction Time/ETL Time and assign it values from distinct on CreateTime. The user can pick a date and the report will filter on this date. Perfect.

The problematic part starts in Analysis Services. Only one extraction time can be the current member. There is no sense in summing or aggregating more than one extraction time. There are two ways to deal with it:

The wrong way - I'm writing this down so you'll learn from this design mistake. We created a calculated member which takes the last extraction time:

Create Member CurrentCube.[Extraction Time]. [Calendar].[All].[Last Data Time]
AS
Tail ([Extraction Time].[Calendar].[HALF HOUR].Members,1).Item(0),
Visible = 1;

In every Panorama view, we sliced the Extraction time on the [Last Data Time] member, which will show the user the data of the last ETL run. Our basic assumption was that the user wants to see the current data and when he'll want to see the historical data he will pick another member from the Extraction time dimension. This is the point we missed and it soon  enough bumped into our face.

We forgot that every system, even BI system which is built for analysts should be intuitive and idiot-proof. The users picked members from the day level in the Extraction time dimension. When we asked them why they did it they said that "we wanted to see the data from 03/03/2010 and suddenly the numbers got bigger!". That happened, of course, because the member 03/03/2010 is the sum of 03/03/2010 15:00 and 03/03/2010 15:30. The solution is that no matter which member the user picks in the Extraction time dimension, he should see only one leaf (half-hour level) member at a time. This takes us to

The right way - We set the [Last Data Time] member to be visible = 0 (I kept it for debug purposes) and I written MDX script which I call NonLastEmptyLeaf. This script makes sure that no matter which member in the Extraction time hierarchy will be picked, the user will be sliced only on the last non empty leaf. For example, if the user picks 03/03/2010 he will see the data of 03/03/2010 15:30. This is what will happen also when he'll slice on 03/2010, 2010 and even the [All] member. This is the MDX script:

Freeze([Extraction time].[Calendar].[All]);
Scope(
  Descendants (
   [Extraction time].[Calendar]. [All],
   0,
   SELF_AND_AFTER
  )
);
  this = Tail (
   Filter(
    Descendants(
     [Extraction time]. [Calendar].CurrentMember,
     [Extraction time].[Calendar].[HALF HOUR],
     LEAVES
    ),
    [Measures].CurrentMember > 0
   )
  ).Item(0);
End Scope;

Now, there is no meaning of [Last Data Time] member because slicing on the [All] member will show the current data, meaning that no slicing at all will show the current data because [All] is the default member.

To conclude, history browsing can be dangerous if we let the users the option to slice on more than one data time. The open question that I have is how can I prevent the user from picking more than one member in this hierarchy in Panorama views? Should I even try to do that or should I count on his minimal common sense?

Wednesday, March 03, 2010 9:34:48 PM (Jerusalem Standard Time, UTC+02:00)
 Sunday, February 14, 2010

I knew for a long time that measures can have text and not only numbers, but last week I had a scenario where it was the perfect solution for my problem.

Let's say that I work in number of jobs. In every job I got some tasks that I need to do. The fact table looks like this:

Job

Task

Tasks To Do

Tasks Completed

Developer

Develop BI System

1

1

Developer

Help Friends

1

0

Manager

Manage

1

1

Therapist

 

0

0

Summarizing the "Tasks To Do" column will get the total tasks that I need to do. Summarizing the "Tasks Completed" column will get the tasks that I did. Note that in the Therapist job I didn't get any tasks.
In the OLAP cube, I created a calculated measure which is the percent of the completed tasks (for every job or for any other dimension). The problem is: What is the answer in case that there are no any tasks to do? Some will say 0% and some will say 100%. That's why a text measure such as "There are no tasks" can be a perfect solution for this kind of problem. This is the calculation of the calculated measure:


Create Member CurrentCube.[Measures].[Completed Percentage] as
IIF(IsEmpty([Measures].[Tasks To Do]),
  null,
  IIF([Measures].[Tasks To Do] = 0),
    'There are no tasks',
    [Measures].[Completed Tasks] / [Measures].[Tasks To Do]
  )
)
This scenario can happen in many cubes. Tell me if you have any other solution for this kind of problem.

Update: Read the important comments below.

Sunday, February 14, 2010 9:49:00 PM (Jerusalem Standard Time, UTC+02:00)
 Monday, January 04, 2010

For documentation purposes: When you get "Unexpected error occured: Error in Application" error in SSAS while trying to view the cube's calculations tab, you'll need to close BIDS and do some file copy tasks.

Monday, January 04, 2010 6:22:08 PM (Jerusalem Standard Time, UTC+02:00)
 Monday, November 09, 2009

In both SSIS and Informatica, the union transformation doesn't delete duplicate rows. In SSIS it's clear because the transformation is called "Union all", which reminds us the union all statement in SQL which doesn't remove duplicate rows. The confusing part is in Informatica, where the transformation is called Union although it doesn't remove duplicates. One of my teammates got confused this week so I thought it's worth blogging about.

One more tip about ETLs with union transformations: When you'll investigate, maintain or fix ETL it will be much help if you'll know where each row came from. That's why I recommend adding a column named "src" or something like that and store there the source of the row before the union happened.

Monday, November 09, 2009 11:41:31 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, October 27, 2009

Lucky for me, I have the opportunity to work with more than one ETL tool in my daily work. In the far past, I worked much with DTS and PL/SQL packages (which is an ETL, but not exactly a tool). Nowadays I work both with SSIS and Informatica so I can compare them in several aspects. I don't think that one is better than the other. It's just that they have different approaches to the ETL mechanism. Anyway, I always expect that the data transformations will behave the same in every tool and I found one transformation that behaves differently between the tools - the Router.

The Informatica's Router, if you're not familiar with, receives a data row and routes it a specific data flow. Its parallel transformation in SSIS is the Conditional Split. As I said, I always thought that they behave the same. In both transformations you provide boolean expressions that determine where the row will be routed. The difference is that in SSIS, "each input row can be sent to only one output, that being the output for the first condition that evaluates to true" (from SSIS's documentation). In programmer's words, it behaves like switch-case statement with break in each condition. In Informatica, "If a row meets more than one group filter condition, the Integration Service passes this row multiple times" (from Informatica's documentation). Back to the programmer's language, it's like switch-case statement with no break inside the conditions.

If you know other transformation that behaves different between ETL tools, I'll be happy to know.

Tuesday, October 27, 2009 11:10:34 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, April 07, 2009
Very easy. In Sql Server:
Select top n * from my_table
In Oracle:
Select * from my_table where rownum <= n

It can be very useful in many many cases. For example, you're designing DWH over a system and you're looking at a certain field in one of its tables. You want to know which values this field contains, but fetching "select distinct my_field from my_table" takes too much time. Instead, if you believe know that the data is well distributed, you can use "select distinct my_field from my_table where rownum <= n". Use 1000 for n in the first trial and add one zero in the end of the number n every time until you got a query that takes too much time than you want to wait. after you got the n you can live with, can use the values you have in your query result.

Wednesday, April 08, 2009 5:54:28 AM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, March 22, 2009

There are some things you can only learn in the hard way. It didn't happenned to me personally but to my team friends, so I consider it as it is my bad.

We upgraded our ETL tool - Informatica, from version 8.5 to 8.6. We had to run some tests to see that the results are the same. So, what we did is to save the result table from 8.5 in Excel, save the result table from 8.6 in Excel and then compare them using Excel-built-in functions. The only problem is we didn't pay attention to the places where zero and null interchanged. This happened because the two versions act differently where null values take places in aggregation functions. For example, when there is a sum function and it aggregates only on null values. In one version the output is zero and in the other the output is null...

This can also happen in other tools and technology. For example, in OLAP cubes the difference between zero and null is the difference between seeing the member of the dimension on the screen and not knowing of its existence.

For conclusion, always be aware to this point and don't forget to check it.

Monday, March 23, 2009 6:09:59 AM (Jerusalem Standard Time, UTC+02:00)
 Saturday, January 03, 2009

You should know that when you run a program by starting a job with a CmdExec step, the directory in which the program is running in will be c:\<windows dir>\system32. How can this affect you? For example, I created a .Net console application that has a settings file with it. When I ran it using the SQL Server Agent, it couldn't find the settings file (worse - it used the default settings and that caused many trouble finding the problem). After some research, I found that it's looking for it in the directory I mentioned.

Sunday, January 04, 2009 6:24:58 AM (Jerusalem Standard Time, UTC+02:00)
 Wednesday, December 03, 2008

Today I had a very disturbing coincidence.
My friend Ariel worked on a SSAS solution with no version control (we're using VSS). Instead of using that, he developed by opening the database on the server. I told him that he must fix it and we must have a recent version-controlled solution. In the past we asked Microsoft support how to do that (we lost all our vss files and had only the databases). They simply said that it's not possible. Ariel has found today that it can be made very easily using File -> New Project -> Import Analysis Services Database, as you can see in the picture:

Thursday, December 04, 2008 5:09:58 AM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, November 18, 2008

This is a little bit tricky. Unlike the AdomdClient assembly, the AdomdServer assembly  doesn't have a descriptive name. It's called msmgdsrv.dll and it is located in Program Files\Microsoft SQL Server\MSSQL.2\OLAP\bin. Why it's not documented anywhere?

Tuesday, November 18, 2008 11:17:16 PM (Jerusalem Standard Time, UTC+02:00)
 Monday, November 17, 2008

After announcing the MdxInjection program I got several requests for additional details and for the ability to run it without using Visual Studio. So, here are some important points:

  • When I published it I had developers in mind because I'm sure than anyone will want to do his little modification before using it for his own needs. That's why I published it as a solution and not as executable.
  • I written it down using VS2008 but only with the .Net 2 framework. Those of you who uses VS2005 won't be able to open the solution.
  • The program has only one public method - InjectMdx, who takes two arguments: The location of the CommonMdx file and the location of the xml configuration file.
  • The CommonMdx.mdx file contains the common MDX script. The relevant part has to start with /* Common MDX */ and then the common mdx script. Anything written before it won't be treated. That gives you the ability to save some data or comments for yourself in this file.
  • Example of the configuration xml file can be found in the Test libary inside the solution. Basically, it enable you to define in which servers, databases and cubes you want to inject the common script. Pay attention that you have to write the connection strings in this file.
  • Note that the program will detect cube dimensions with their name changed and will know how to replace them. That means that if you mention the Time dimension in the common script and inject it to AdventureWorks cube, the script will replace the string "Time" with the "ShipmentDate" string, for example.

For those of you who want simple execution file, I added a windows console project in the solution.

Link to only executable program
Link to the solution with the added windows application project
Link to the solution without the windows application project
Tuesday, November 18, 2008 6:54:17 AM (Jerusalem Standard Time, UTC+02:00)

In the previous post I talked about the DRY principle in the BI Development. I mentioned that one of the major problems in the principle's implementation is in the common MDX code. Chris Commented:

"I'd like to be able to have a global MDX Script and be able to do something like a #include to bring calculations into specific cubes. One to add to my wishlist for the next version..."

And as I said there that I have a temporary good solution until we'll have it in the next SQL Server release (if someone from Microsoft is reading...).

The MdxInjection program takes your common MDX Script and a very simple xml file that defines where to inject this script. It injects the script into your desired cubes and even replaces the dimensions' names where necessary (it is relevant where you put dimension in a cube with a different name to thr dimension or when you use Role Playing Dimensions). I couldn't hold myself from writing some test code so it's also included in the project. The project is written in C# 2 using much AMO code. All the technical little details are inside.

Enjoy.

Download Link

Monday, November 17, 2008 8:29:02 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, September 22, 2008
My friends were stuck with a totally weird bug this week. After a day of frustration they called me for the rescue. It took me some time to figure it out and I think that every SSIS developer (and maybe every developer) can learn a thing or two from others' mistakes.

The mission: The data flow takes one table with duplicate rows and copies it to another table and makes sure that every row will appear only once. In the way, the data flow also adds some irrelevant fields. Among them is the Create_User and Create_Date fields which tells by who and when the package last ran.
How my friends did it: Again, it's a very simple flow. They only added Derived Column transformation to add the new fields and then they added an Aggregate transformation to make every row appear only once.

Note that this is not the real package. It's a sample I did on my machine to show it here.

The Bug: When I first seen this it seemed to me very simple flow and I asked myself how can it be that this is happening:

As you can see, it seems that the Aggregate transformation is not deterministic. Sometimes it outputs 99 rows, sometimes 198 and in some other times I get other results as well.
Investigating: I wanted to see what's the difference between the table that I got in the first time (99 rows) and the table I got in the second time (198 rows) so I changed the destination table and compared the two tables. I ran "select * from A where Column1+Column2+... not in (select Column1+Column2+... from B)"-style query but it was no use - it showed me that there were no rows that appeared only in one of the tables. In this step I really started to think (as my friends did) that maybe the Aggregate transformation has something wrong inside... Instead of blaming Microsoft, I decided to think. I needed to see what can make the flow non-deterministic. Then, it hit me.


The only non-deterministic component in the flow is the Derived Column because it has the getdate() function (it may be simple to see here, but in the original package the derived column transformation had many fields). The results of this function may differ in the milliseconds, especially for large tables. Then I looked in the Aggregate transformation and seen that the Create_Date column also was in the Group by operation, meaning that if two rows has different millisecond they will be placed twice in the destination table, although they are the same in every column. That's it, the bug was found. But still, one question remained: Why the query did not show me this? The answer is also simple but tricky to find: In the comparison query I concatenated all the columns in the tables in order to compare the results. When I did this, I casted the Create_Date to nvarchar which truncated the milliseconds.

Conclusions:
  • Pay attention to non-deterministic elements in what you do, whether it's code or ETL process.
  • When you do dummy stuff like checking all the checkboxes in a list - think what are the outcomes.
  • Call Miky when you're desperate.
Monday, September 22, 2008 8:10:48 AM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, August 12, 2008
Found a great site for BI beginners. Learn Microsoft BI has some videos about BI and SSAS which can place you in a good position as a beginner.

Tuesday, August 12, 2008 9:36:52 PM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, August 03, 2008
update: Chris Webb and Mosha commented and made it clear that the reason for this error wrap is the NonEmptyCrossJoin function and nothing else. I also checked and I did not find any other function that wraps underlying errors.

This is something you need to be aware of when you're writing MDX. I don't know whether it's a bug or by-design. I'll be happy
to know (please comment if you know something that I don't).
Consider the following MDX:

SELECT
  NonEmptyCrossJoin
  (
    [Customer].[Customer Geography].[State-Province].&[NSW]&[AU].Children
   ,[Employee].[Employee Department].[Department].&[Sales]
  ) ON 0
FROM [Adventure Works];

The query will return with this error: The Set_Count argument of the NonEmptyCrossJoin function is either negative or larger than the number of sets provided. This is quite reasonable because I written the second argument as a member, where (NonEmpty)CrossJoin expects only sets. So, let's upgrade this member to a set:

SELECT
  NonEmptyCrossJoin
  (
    [Customer].[Customer Geography].[State-Province].&[NSW]&[AU].Children
   ,{[Employee].[Employee Department].[Department].&[Sales]}
  ) ON 0
FROM [Adventure Works];


All I did is wrapping the second argument with {} and we have a set. The query will return 19 columns.
Now, for the interesting part. Let's count the members of this CrossJoin before the we fix it:

WITH
  MEMBER [a] AS
    NonEmptyCrossJoin
    (
      [Customer].[Customer Geography].[State-Province].&[NSW]&[AU].Children
     ,[Employee].[Employee Department].[Department].&[Sales]
    ).Count
SELECT
  [a] ON 0
FROM [Adventure Works];


This return... 0.               
Where is the error?
My guess is that the Count function wraps the error. The NonEmptyCrossJoin returns null and the count of members in null is zero. The meaning of this is that if you'll ever forget to wrap the member with {} you'll always get zero and not an error. This can be very dangerous. Just for the check, running this query after the fix:

WITH
  MEMBER [a] AS
    NonEmptyCrossJoin
    (
      [Customer].[Customer Geography].[State-Province].&[NSW]&[AU].Children
     ,{[Employee].[Employee Department].[Department].&[Sales]}
    ).Count
SELECT
  [a] ON 0
FROM [Adventure Works];


will return 19. This been tested with both SSAS 2005 and 2008 (RC0). The examples here are from RC0.

Be careful with your MDX.

Monday, August 04, 2008 5:28:26 AM (Jerusalem Daylight Time, UTC+03:00)
 Wednesday, July 23, 2008
When you practice on SQL Server on your local machine you don't want that its services will start up with the computer. As I mentioned before, you should declare the startup method of these services as manual (see here). After that, you can build two simple batch files that will start and stop the services. Believe me - it's very comfotable to start and stop the services with only one mouse click. The first batch file (I called it sql.bat) contains only two lines:

net start MSSQLSERVER
net start MSSqlServerOLAPService

The second one (sqlend.bat) looks like that:

net stop MSSQLSERVER
net stop MSSqlServerOLAPService

Note that I only start/stop the SQL Server and analysis services, but you can do whatever you like.

Have fun.

Thursday, July 24, 2008 5:19:08 AM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, July 20, 2008
Sometimes the uninstall process does not succeed or even worse - the "Add or Remove programs" interface does not allow you to uninstall the product because it already uninstalled / doesn't exist / you name it. The problem is that the uninstalled program can't be removed from the list, it can't be uninstalled and it prevents another installation or re-installation. This happens a lot with Microsoft's heavy products such as SQL Server and Visual Studio but it can also happen with other products too.
What can you do?

Here's a small tip: Open the registry editor (Start -> Run -> regedit) and go to the path: My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall.
Under this folder you'll see many folders with GUI names such as {1268CDD4-0FED-3CE6-8A9D-C3B012ABCD8F}. To know what installation it is, look for the key named DisplayName under this GUI folder. In most of these folders you'll see a key named UninstallString. To uninstall this program, copy the value of UninstallString and paste it in the run dialog. This will start the uninstall process.

This trick will not always work, but it can help you a lot. Especially with broken installation of SQL Server.

Enjoy.

Monday, July 21, 2008 6:24:27 AM (Jerusalem Daylight Time, UTC+03:00)
 Monday, May 26, 2008
One more tip about installing the database samples: I believe that installing them is not enough. In order to improve your skills you need to have a deep knowledge of them. Therefore, don't deploy the SSAS project to the server and that's it. Build it yourself. Yes - create a new project called MyAdventureWorks or something like that and build all the objects by yourself. Indeed, this will take time and strength but this is worth this. After you'll do all the tricky things yourself then you really got it in hands. Learn the AW project and be a master.

Monday, May 26, 2008 7:44:47 AM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, May 13, 2008
I started a long conversation about this subject in the MDSN SSAS forum. I think that it's a question and a principal that every advanced MDX programmer should be familiar with.

It all started with a customer that needed a standard deviation aggregation. I thought that it would be simple because there's a StdDev function in MDX, but it turned out that my customer had major plans for me: He wanted this aggregation to act for every dimension he puts on his axis. This means that the aggregation is not defined over a specific dimension (such as date), but the std-dev is defined over the current dimension in the axis.

The solution for this problem consists of a principle and an answer.

The Principle
Aggregation or a measure that is based on the current user's query is bad. This can and will result two users to see different results using the same measure. This will cause confusion and disinformation. The sacred principle of One Truth will be desecrated. Taken from the thread, in Chris Webb's words:

"I quite often see people wanting to write calculations that behave differently depending on the query that's being run, and I always tell them not to do it. You can hack something but it's almost impossible to get it work properly for every single possible query - MDX just doesn't work like that"

In the end I explained that to the user and he agreed. One more reason for his approval is that std-dev often doesn't really says something about the data. In other words, it isn't informative. "The standard deviation is 0.432. That means that... ???"


The Answer
If you (or the customer) still insists on that crazy measure, the following MDX will work.

With
Member [Measures].[RowSTDOrders] as
iif(Count(NonEmpty(StrToSet("Axis(1)").Item(0).Hierarchy.Children,
{[Measures].[Order Quantity]}) as ChildSet) < 2,
Null,
StDev(ChildSet, [Measures].[Order Quantity]))
 
select
[Date].[Calendar Year].[Calendar Year] on 0,
Non Empty [Product].[Product Categories].Members on 1
from [Adventure Works]
where [Measures].[RowSTDOrders]

Thanks for Deepak Puri for this code. Notice that the StrToSet function will cause performance degrade, but this is the only way that the code will also work in MDX script and not only in queries.

P.S
It doesn't matter if you write StDev or StdDev.
Wednesday, May 14, 2008 6:28:38 AM (Jerusalem Daylight Time, UTC+03:00)
 Thursday, May 08, 2008
This tiny thing cost me a minute today, but it may take longer time to some of you, so I'm writing this.

As some of you know, in order to sort a dimension's attribute you need to change to OrderBy property of the attribute. You can make the attribute to be sorted according to other attribute (it's a very common thing in SSAS). In order to do so, you set the OrderBy property to AttributeKey and in the OrderByAttribute property you pick up the desired attribute (the one you want to define the order).

Note that if the first attribute (the one you want to sort) doesn't have attribute-relationship to the second attribute, you won't be able to pick up the second attribute in the OrderByAttribute property. These properties must have an attribute-relationship.
One more thing: You don't have to show the end-user the attribute which defines the order. If you want to hide it just set the property AttributeHierarchyVisible to false. It is a common pattern to make an attribute which will sort another attribute and hide it from the user.

Thursday, May 08, 2008 7:32:59 AM (Jerusalem Daylight Time, UTC+03:00)
 Tuesday, April 22, 2008

One more thing about getting a file from the web/SharePoint and using it as a source in SSIS: If you need to authenticate just change the xml.open command to:

xml.open "GET", URL, false, "user", "password"

where user and password are the user & password that has permissions to the desired file. Note that it is VERY recommended to have an application user, so the password won't be changed in the future. If you don't have such user and you must change your password in the future, do not forget to change it in the script. My tip: add a reminder in your calendar to change the password in the script.

In this point I don't know if you can authenticate using SSL or stronger protocols using VB script.

Tuesday, April 22, 2008 10:20:05 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, March 31, 2008
We got many client requests for the ability to show in their web sites the "last updated" date of the data.
It doesn't matter how you show the data of the SSAS - the customers will always want to know for which date the data is true.
My solution includes a ASP.NET 2.0 web site that uses the AMO class libary. It takes the date from the server and shows it to the user.

What you need to do is:
1. Open a new ASP.NET web site using Visual Studio 2005/8.
2. Add the AMO dll (Microsoft.Analysis Services). You'll find it in the SSAS server.
3. In the already-made default.aspx page, just add one Label.
4. Add a configuration file which will hold the name of the SSAS server. That way, when you install the site from the development environment to the production environment, you'll only have to change this file. Call this file config.xml and write in it the following:
<?xml version="1.0" encoding="utf-8" ?>
<ServerName>YourServerFullNameHere</ServerName>

5. In the code-behind file (default.aspx.cs) write the following code instead of what you already have there:

using System;
using System.Data;
using System.Configuration;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using AMO = Microsoft.AnalysisServices;
using System.Xml;

public partial class _Default : System.Web.UI.Page
{
  protected void Load_Page(Object sender, EventArgs e)
  {
    Label1.Text = GetCubeUpdateDate(Request.QueryString["DBName"],Request.QueryString["CubeId"]);
  }

  private string GetCubeUpdateDate (string dbName, string cubeId)
  {
    using (AMO.Server asServer = new AMO.Server())
    {
      asServer.Connect("Data Source=" + GetAnalysisServerName());
      AMO.Database db = asServer.DataBases.FindByName(dbName);
      if (db == null)
      {
        return "DB Name not found";
      }

      AMO.Cube cube = GetCubeById(cubeId, db);
      if (cube == null)
      {
        return "Cube Name not found";
      }

      DateTime lastProcessed = cube.LastProcessed;
      return lastProcessed.Day.ToString() + "/" + lastProcessed.Month.ToString() + "/" + lastProcessed.Year.ToString();
    }
  }

  private string GetAnalysisServerName ()
  {
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.Load(Request.PhysicalApplicationPath + "config.xml");
    return xmlDoc.GetElementsByTagName("ServerName").Item(0).InnetText;
  }

  private AMO.Cube GetCubeById (string cubeId, AMO.Database db)
  {
    foreach (AMO.Cube cube in db.Cubes)
    {
      if (cube.ID.Equals(cubeId))
      {
        return cube;
      }
    }
    return null;
  }
}

Eventhough the code is self-explained, here are some points referring it:
  • I chose not to include the server name in the web.config file because I like to seperate application-related configuration and web configuration.
  • If you want you can get the cube name from the user (in the query string) and then the code is even shorter - just get the cube like I got the database.
  • I wanted to show the date in the format DD/MM/YYYY, so that's why I did the long return statement in the GetCubeUpdateDate method. If you want to return the date in the MM/DD/YYYY format you can use the lastProcessed.GetShortDateFormat() method.
  • Note that when you publish the web site you need to create a dedicated virtual folder in the IIS.
  • The user uses this site in the following way: All he need to do is to create a frame with this site's address as its source and add it the DBName & CubeId in the query string. In SharePoint it's even easier - the uses only need to create a page shower web part.
enjoy.

Monday, March 31, 2008 11:05:20 PM (Jerusalem Daylight Time, UTC+03:00)
 Wednesday, March 19, 2008
Yesterday, my friend Ilya asked me how to perform average for dates. I explained him that actually, the dates in SQL Server are represented as numbers, where zero is 01/01/1900. All you need to do is to cast the dates to numbers, make average on them and getting the result back as date. Assuming that the date column called MyDate, Here is the code:

Cast(Avg(cast(MyDate as float)) as datetime)

Thursday, March 20, 2008 12:04:23 AM (Jerusalem Standard Time, UTC+02:00)
 Saturday, March 15, 2008
It figures that using simple excel file as a source in SSIS is not so trivial, especially if your source is in the web or in your SharePoint portal. At first you'll think it's easy - just declare the excel source as a url (the url of the excel file, for example) and it will succeed. The problem is that Microsoft let you think it's working. Click on the excel source and you'll see in its properties that the source path is the local temporary internet files, meaning that the source is a local copy which is not up-to-date, so it's worth nothing.
Here's what I tried to do and the final (and successful solution):

1. Use the File System task. It won't work because you can't declare an URI there.
2. In the MSDN forum (I can't find the link right now) they say to write a script, so I also tried this. Using the Script task, I written a code in VB.NET which using the System.IO libary of the .NET framework and copies the excel file (using its URI) to the desired location in the local computer. Running it, I got an error saying that the script can't use URIs...
After trying this I understood that every code or action running in the SSIS context won't work with URIs. I'm not sure I know why Microsoft developers built it that way (or maybe it's just another bug). Anyway, the next step is the solution.
3. Build an executable file that performs the desired copy task. You can't use regular batch (.bat) file because DOS/CmdExec does not know how to work with URIs. So, there are two ways to perform this:
a. Download this and use it as a copier from the web.
b. Use the following code and save it as a Visual Basic Script file (*.vbs):

'GetRemoteBinaryFile.vbs
TheFile = "myExcelFile.xls"
DestFolder = "C:\SSIS_Sources"
URL = "http://mySite/myFolder/myExcelFile.xls"
Set xml = CreateObject("Microsoft.XMLHTTP")
xml.Open "GET", URL, False
xml.Send
set oStream = CreateObject("Adodb.Stream")
Const adTypeBinary = 1
Const adSaveCreateOverWrite = 2
Const adSaveCreateNotExist = 1
oStream.type = adTypeBinary
oStream.open
oStream.write xml.responseBody
' Overwrite an existing file
oStream.savetofile DestFolder & TheFile, adSaveCreateOverWrite
oStream.close
set oStream = nothing
Set xml = Nothing

After you have your file (vbs or exe) you can use the Execute Process Task in order to make the copy. In the task, declare that you want it to run your exe or vbs.  After that, just use a normal Data Flow Task, where the source excel file is in the local computer (the file that was copied in the previous task) and the destination is your desired DB.

Note that:
1. Before executing you must have the excel file already placed in your local computer, meaning that you must make the first copy before the first time you run the package. This is because SSIS performs integrity check before running the package and it checks that the file exists.
2. Even if the copy process is long (because it's coming from the web), don't worry. SSIS work synchroniously, meaning that the Data Flow task will not start until the Execute Process task which copies the file will end.

Enjoy.

Update: I added a post about authentication.

Sunday, March 16, 2008 6:47:26 AM (Jerusalem Standard Time, UTC+02:00)
 Friday, February 01, 2008
This is a good one: When you build a flat file connection to a csv file, you can preview the data. There, there's an option to skip some rows (Data rows to skip). If you'll leave it with a number greater than zero - the process itself will skip these rows!! I still wonder if this bug is By Design or not. If you wish, you can track this bug in Microsoft Connect.

Friday, February 01, 2008 2:00:16 PM (Jerusalem Standard Time, UTC+02:00)
 Monday, January 21, 2008
Today I was in the Microsoft's Data Mining Conference which took place in the Sheraton City Tower, Ramat-Gan (Israel, of course). First of all - the food was good. :-) Now, seriously: All the lectures were great, although they were performed by one man - Rafal Lukaweichi, which is a very talented speaker. I think that I haven't seen such enthusiasm for many years in a lecture in the IT world. Anyway, what have I learned today?
  • The Data Mining world is very interesting indeed. Microsoft has a lot to offer in DM and it is all ready-to-use in BIDS.
  • Microsoft's approach is DM to the masses, which I don't believe it myself. Even though the tools are very simple and even the code (DMX) is easy (in contradiction to MDX), I don't think that an inexperienced developer can bring good results. The SAS approach says that you need to have deep knowledge in statistics (which is bad), but I don't think that DM can be made by the masses.
  • There are many different DM algorithms which you need to be introduced with before you start mining. As I mentioned, they tell you that all you need to know is what each algorithm does in general, but in fact there are many parameters which you need to adjust and play with, so in fact you need a good knowledge of these algorithms.
  • Visualization is very important in DM. Even after you have good results in your hands, you need good UI tools to show you the results in an efficient way or else you'll be lost in a jungle of data.
  • If you already have a datawarehouse, you're half way from mining models. The preparation of the data is a huge amount of the job in DM.
  • After you have good results and even after you got good visualization of the results, you need an expert from the company you work for (or in) that will look at the results and tell you whether they bring new knowledge or they trivial.
I don't believe there's a change I'll be mining in the near future, but maybe I'll play with it a little in my free time (which of course I don't have). Taking the data of our datawarehouse and mining it can bring some interesting stuff. Who knows.

Tuesday, January 22, 2008 3:06:14 AM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, January 15, 2008

Although SSAS willl let you use them, some other application such as Panorama won't function properly. I'll give one example: When you perform Drillthrough in Panorama, on the fly the engine gererates a web page that will take the user to the next view. The next view will be sliced as the current view, so this web page needs to pass the dimensions parameter (meaning - the current slices). That's why this web page contains this line:

AttachParameters("Slicers","%Slicers%")

A big problem will occur if one of the sliced dimensions will be sliced on a member with a name that contains inverted commas ("). The JavaScript will result an error because there are three inverted commas in the second parameter of the line. This is only one example of what can happen in a BI consumer program if you'll use special characters in member names. So - be careful not to pass these characters from the DW (build the ETL so it will drop these characters) or giving those names in SSAS, such as the All member name.

Tuesday, January 15, 2008 11:08:32 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, January 08, 2008

I feel like I don't have the right to write about it after so many bits of information were moving in the web about this subject but I would like to add my point of view (or in fact - my point of code). First of all, I must mention some of those who written about this subject before me. Mosha Pasumansky wrote a long post about it in last may. This post contains some ideas of how to come along with this problem, but none of them is perfect. In fact (as always) - there is no perfect solution for this problem. Another important source of knowledge can be found here in the MSDN forums, where Chris Webb, Thomas Pagel and others discussed it. Now, I would like to add my solution. Take it or leave it - your choice.

First of all, create a column in the time dimension that will be the current day indicator. Thanks to the Data Source View or the UDM approach of SSAS 2005 you don't have to change the relational table itself. Just add a named calculation in the DSV with expression that will be 1 for the row of the current day and null or zero for the others. The expression syntax itself depends on the underlying DB so I won't write it here, but it's very simple. Add this column as an attribute (let's call it CurrentDayInd) in the dimension structure and select its AttributeVisible propertiey to false. That's because we don't need such Attribute Hierarchy in our time dimension. After that, create a new User Hierarchy (you can call it Current Day), where the first level in it will be CurrentDayInd and after it place the day (key) attribute.

Now, what we have here? We have a hierarchy (Current Day) with two members - 1 and zero. The 1 member has only one child which is the current day. Link that member to your regular hierarchy (it's called YSQMD at my place) and here you have it. For example, you can use it that way in the MDX script:

Create Set [Last 30 Days] as
LinkMember([Time].[Current Day].[CurrentDayInd].&[1].Children.Item(0),[Time].[YSQMD])
:
LinkMember([Time].[Current Day].[CurrentDayInd].&[1].Children.Item(0),[Time].[YSQMD]).Lag(30);

I think that this solution is good and elegant. First of all, you don't have to use external functions such as Now(). The second pro is that we use the native OLAP mechanism which saves time and makes the queries run faster. Believe me, you'll feel the difference with big cubes. The last thing is that this solution is easy to understand (at least I think so) and it is easy for maintenance. The big con is that you have to process this dimension (and related cubes) every day. I don't think that it's so bad because most of the organizations do make process every day.


 

Wednesday, January 09, 2008 5:39:58 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, January 07, 2008

In the previous parts (1, 2) I showed how to connect Informatica with MS-OLAP, meaning that a mapplet can process cube or dimension. The thing is that I focused on the side of MS-OLAP. In the second part I even wrote the T-SQL code itself. Now I want to close the loop by describing what's going on in the Informatica side. This part was made by my friend, Alex, who permitted me to post here about what he did.

First of all, there's a table which contains the parameters to call with to the MS-OLAP procedure (object id, type, user name, etc.). This table is the source (& source qualifier, of course) of the mapplet. Each row in this table calls the stored procedure in the MS-OLAP side (in fact, the procedure is part of the relational DB, but never mind now). The call to the SP is made with Informatica's Procedure block. The connection is a regular ODBC connection, as mentioned in the previous part. Now for the interesting part: In the mapplet, the result of the procedure (zero for success, one for failure) goes into a Java Transformation block. This java block will fail the mapplet if one or more procedure calls returned failure.

How to do this java block? Double click on it to enter its properties. Go to the "Java Code" tab. There you'll see tab for every event in this block's life cycle. Here is the code for every tab (only the relevant ones):

Helper Code:

static int errorCounter = 0;
static Object lock = new Object();

On Input Row:

if (returnValue != 0)
{
 synchronized(lock)
  {
   errorCounter++;
  }
}

On End of Data:

synchronized(lock)
{
 if (errorCounter > 0)
  {
   failSession("OLAP Objects failed");
  }
}

Note that:

  • I'm not sure that the lock mechanism is required here. sync, lock, semaphore, etc. mechanisms are often used when atomic write is needed in order to solve problems like deadlocks, mutual write, etc. Here I simply don't care. Even if two parallel threads will read the errorCounter as zero and they both will increase it to one (when in fact it needs to have the value of two) it won't be a bug because the session will fail anyway. Alex & I need to talk about this point...
  • failSession is a function which is part of Informatica's API. As you might guess, it will fail the whole mapplet.
  • Very important: Calling all the MS-OLAP objects at once will cause an error in the Analysis Services server and all the objects will be in the state of Unprocessed. The Informatica side has to call the dimensions first and only then the cubes. The cubes must not be called all at once if they have relationships between them. This will cause deadlock too.
Tuesday, January 08, 2008 5:35:17 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, December 31, 2007

As many of you already know, installation on Microsoft Excel on the SSAS 2005 server is needed in order to use Excel functions in MDX. That's very helpful because MDX is lack of many important functions such as Round (!). Many organizations don't like it at all, but here's something that may help. In the SSAS 2005 server you don't need to install the whole program, only the .Net programmability support. In the installation, choose to manually pick up which components you wish to install and then choose the .Net programmability Support as seen in the picture:

Notice that this issue will not be fixed in SSAS 2008, so this tip will be relevant for a long time.

Monday, December 31, 2007 5:06:00 PM (Jerusalem Standard Time, UTC+02:00)
 Sunday, December 09, 2007

Last week I participated in Microsoft's BI conference in Ra'anana, Israel. After the conference I asked myself: What have I really learned today? Well, here is what I remember:

  • Microsoft figured out that the eternal BI tool is and will be Excel. People just love their Excel sheets and they will stay there. This is why the mission is to bring the BI into their Excel sheets. Their new product - Excel Services, will manage our excel sheets in one central place which is connected to our Analysis Services cubes.
  • In my point of view, SQL Server 2008 is just a bunch of many performance issues and it is not really a new product. There are a lot of new "performance features". For example, most of our MDX queries will run faster, especially those who has null cells. The new Cell-By-Cell calculations performance improvements will cause these queries to run faster. I think that SS2008 could be one big Service Pack. If I'm wrong, please do comment me.
  • SQL Server 2005 has many products that we don't know good enough. Some products that I need to learn about are: Replication, SQL Server Agent, SQL CLR and more. I do know what they do and even played with them a little bit, but I want to know how they can help me and improve my work.
  • Many new features in SS2008 come from two old sources: BIDS Helper (SSAS open source addin) and of-course, Oracle...
  • My big wish - IntelliSense for Analysis Services will not be in SSAS2008 and maybe won't be at all. This is because the guessing is MDX is very hard. There are too many options in every statement you write.
  • We won't need to upgrade to Office 2007 in order to use Excel Services. Only the developers will need it.

This is what I remember for now. I'll update this post if something new will come around in my mind.

Sunday, December 09, 2007 9:26:17 PM (Jerusalem Standard Time, UTC+02:00)
 Sunday, December 02, 2007

In the last post, I explained the architecture of our BI project. The final part of the process is processing MS-OLAP object (cube/dimension) from Informatica mapplet. As explained earlier, the trick is to call Stored Procedure from the Informatica server. But first there is one more thing to do: How do you connect the Informatica server (Linux) with MS-OLAP (windows server)?

Informatica ships with number of drivers that can connect it with other servers. The drivers are called DataDirect and I'll discuss 4.20. You need to define this driver on the Informatica server (look in Informatica knowledge base for more information). This is an easy thing to do. Notice that you have to enter a full server name (including domain) and the password. Remember that if you'll change the password in the future the process will fail. You have to enable the protocol named "Named Pipes" in the MS-OLAP server. How to do this? Enter the Configuration Manager in the MS-OLAP server and in the section of MSSQLSERVER protocols enable the Named Pipes protocol. This will enable the connection from the Informatica server. On the Informatica server, make a regular ODBC connection.

Here is the code of the SP on the MS-OLAP side. This SP must be on the msdb Database on the Database engine.

ALTER PROCEDURE [dbo].[ProcessObject]
@databaseId varchar(100),
@objectType varchar(100),
@objectId varchar(100),
@login_name varchar(100),
@returnValue int output,
@errorMessage nvarchar(1024) output
AS
BEGIN
declare @jobName varchar(200)
declare @xmla varchar(1000)
declare @jobId binary(16)
declare @ReturnCode int
declare @stop int

--Set job name
set @jobName = 'Process' + @objectType + '_' + @objectId

--Delete the job if already exists
if exists (select * from msdb.dbo.sysjobs where name = @jobName)
exec msdb.dbo.sp_delete_job @job_name = @jobName

--Create the job
Exec msdb.dbo.sp_add_job @job_name=@jobName
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'process OLAP object',
@category_name=N'[Uncategorized (Local)]',
@owner_login_name=@login_name, @job_id=@jobId output

exec msdb.dbo.sp_add_jobserver @job_name=@jobName, @server_name=@@SERVERNAME

--Declare XMLA for OLAP object
if (@objectType = 'Cube')
set @xmla = '

' + @dataBaseId + '
' + @objectId + '

ProcessFull
'
else if (@objectType = 'Dim')
set @xmla =



' + @dataBaseId + '
' + @objectId + '

ProcessFull


'
else
Begin
set @returnValue = 0
return @returnValue
End

--Add the job step
Exec msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Process Object',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'ANALYSISCOMMAND',
@command=@xmla,
@server=@@SERVERNAME,
@database_name=N'master',
@flags=0

--Run the job
Execute sp_start_job @jobName

Waitfor delay '00:00:05'
set @returnValue = (select run_status from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)

-- Loop until the job ends and return its result
set @stop = 0
if @returnValue is null
while @stop <> 1
Begin
set @returnValue = (select run_status from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)

if @returnValue is not null
set @stop = 1

waitfor delay '00:00:10'
End

--Return error message (if exists)
If @returnValue = 0 --failed
set @errorMessage = (select [message] from dbo.sysjobhistory
where job_id = @jobId
and step_id = 1)
End

update: I see that the xmla code went bad in the post because it is not recognised html code. It doesn't matter, I believe you got the point...

Sunday, December 02, 2007 5:59:42 PM (Jerusalem Standard Time, UTC+02:00)
 Saturday, December 01, 2007

In the past I mentioned some fragments of the architecture of our end-to-end BI solution. Now I'll discuss how it is done. I will only write about the things that I done (I mean, developed) but I'll describe the whole picture.

Our architecture goes like this: Control M -> Informatica -> MS-OLAP (Analysis Services 2005).

In words: ControlM is the most common scheduler in big companies. We use it to schedule our ETL processes in Informatica. Our system team made it possible to start Informatica processes from ControlM. I don't know exactly how it is done. All I know is that ControlM raises a flag in a table, and Informatica scans the table every X seconds and start the process if it finds the flag that was raised by ControlM. Don't ask me about the technical details - it wasn't my job.

The more interesting thing (because I did it...) is how Informatica calls MS-OLAP and tells it to process a cube. In this part I'll describe the big picture and in the next one I'll give some of the code and discuss some technical views of the process. First, the Informatica mapping moves the data from the source to the target, which is the dimension or fact table just like it always does. After that, Informatica calls a Stored Procedure on the MS-OLAP server which process the cube. Informatica calls this SP with some parameters, including the object type to process (cube/dimension), its ID and some more parameters. In return, the MS-OLAP returns return code (in order to point whether the process succeeded) and message describing the error if it occurred.

How the SP process the cube/dimension? Unfortunately, there is no SP that can process OLAP object so I needed to use the following steps in my SP:

  1. Delete any existing job that does the same action (read on, you'll understand)
  2. Create an empty job
  3. Add a step to that job that will process the object. This step contains XMLA code that contains the parameters that were given from Informatica
  4. Run the job
  5. Loop until the job (or process) ends and send back the return code and the error message, if exists.

In the next part I'll write some of the code and discuss some technical issues.

 

Sunday, December 02, 2007 5:19:47 AM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, November 06, 2007
After our first MS-OLAP first installation, I started asking myself some questions: The way we moved our cubes to the production server is by the Visual Studio. We just deployed the cubes into the production server. The problem was that we forgot that we had some changes done in the XMLA code that lies behind one of our cubes. That caused some troubles in the installation and the result was that we did some changes in the XMLA script in the production server after the installation...
So, how we should make the installation? Should we export the whole database and import it in the production server, or should we generate XMLA scripts and run it in the production server? Is there really a difference between these choices?
If someone has an answer I would be very happy to read it. Thanks...

Wednesday, November 07, 2007 4:17:55 AM (Jerusalem Standard Time, UTC+02:00)
It took me a while (more than 15 minutes!) to find something so elementary. The thing is how to call a sleep function in T-SQL in SS2005. The statement is: Waitfor Delay <DelayLength>, where <DelayLength> can be in the format: '00:00:10' for ten seconds. the <DelayLength> can be also a parameter of the type char(8).

Wednesday, November 07, 2007 3:40:13 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, October 29, 2007

As you can see from the title, we use VSS with AS 2005. The reason for this is that we want to have backup and source control in our AS projects. There are some interesting points when working that way which I would like to share:

  • When you process a cube in a VSS-contolled AS project, the compiler writes the exact time and hour of the last process in the .database file in the solution. This file is a small and simple XML file. This piece of data is written in the <LastProcessed> element. That is the reason why VSS will ask to check-out this file every time you will process a cube. You can ignore it and cancel it twice and the cube will be processed anyway. Of course, the last processing time will not be correct.
  • The .dwproj file is the binder of the solution (as in regular code project). That means that Visual Studio will update this file every time you add, delete or rename an object to the solutions. When two (or more) developers doing these king of actions together it will cause a disaster. Only the objects of the developer who checked-in last will be saved in the project. What you need to do is manually edit this file and add the object's elements yourself. That's not very hard, but you'll need to concentrate. As always, I recommend using Notepad++.
  • When editing the .cube file yourself (when merging manually after two developers worked on the same cube), make sure that the dimensions ID's in the .cube file match the ID in the .dim files.
  • AS projects can be easily destroyed because of an error in a VSS-related decision. Think twice when clicking in its annoying dialog boxes. Make a label every day after building, deploying and processing the project.
Tuesday, October 30, 2007 4:03:45 AM (Jerusalem Standard Time, UTC+02:00)

We use VSS with Analysis Services 2005 in order to have source control and backup (I think this deserves a post of its own). We've been shocked to find out that our VSS didn't save our labes and our versions. The defect we experienced is that after clicking on the history button and then on the OK button in the "choose date range" dialog box, nothing happened. We were sure that no history was saved and that is the reason that nothing happened. After a long search in the web, I found this thread in the MSDN. The problem is a bug in VSS that has something to do with dates representation in Microsoft's code. The workaround is easy, but it seems to work only in Windows 2000: Open the VSS admin program, then Options -> Tools -> TimeZone, and set the time zone as none. That's it.

By the way, notice that my second post in my blog was about date formats and how tricky and dangerous they can be.

Tuesday, October 30, 2007 3:43:05 AM (Jerusalem Standard Time, UTC+02:00)

This is the second bug I found when working with AS and Oracle as my database (you can read about the first one here). Some background about our Datawarehouse architecture before I begin to complain about Microsoft:

We cannot afford risking that our users will experience faults or crashes while we refresh our DWH, process our cubes, etc. What we do to solve this problem is called a switch: We have two schemas in our Oracle DB. While our users watch the first schema, we update the other one. Only after we finish all our load process, we switch our user's tools to see the data from the second schema. In order to implement this architecture we use Oracle's synonyms. Let's say that the users watch the fact table "sales". We have a synonym which is called sales_syn. While it's pointing to the first schema (schema_a.sales_fact), we're loading into the second schema (schema_b.sales_fact). After that, we switch the synonym so it will point to the second schema (schema_b). The users always look on views that rely on synonyms. The views never change, only the synonyms do.

The problem starts when we make Named Queries in our Data Source Views in AS. Apparently, AS looks inside the view that we enterd into the DSV, finds the target of the synonym and saves it. Even after we make the switch (the synonym points on the other schema), the named query will be pointing on the first one. All our efforts to edit the named query have failed. When we open it again, we will always find that the first schema is "burned" there and cannot be changed.

The solution we chose is simply avoid using named queries. If we need a simple calculation we can add a Named Calculation in the Data Source View and if we need a complex view over our fact table, we write it ourself in the DB. This solutions breaks a little bit the main point of the data source view (designated place for all the logic of the DWH), but it is the best solution we could think of right now.

Dear Microsoft developers - it seems that you tried to be smart and look into our Oracle objects in order to enhance the multidimensional database's performance. Next time, please think twice before you do.

Tuesday, October 30, 2007 3:24:36 AM (Jerusalem Standard Time, UTC+02:00)
 Sunday, September 30, 2007
This post is about Panorama because it is the UI tool I'm working with, but this can be made with every BI UI tool.

My customer wanted to get the effect shown by Analysis Services 2005 when browsing a dimension (see the picture below). He wanted to see some properties of the members shown in the rows, along with the usual measures. Unfortunately, Panorama (and I'm sure that also other tools) does not have this option in the GIU. The solution is this code:

Create Member CurrentCube.[Measures].[MyProperty] as
  iif(IsLeaf([MyDimension].[MyHierarchy].CurrentMember),
     [MyDimension].[MyHierarchy].CurrentMember.Properties("MyProperty"),
     Null)

Note that declaring only the third row will cause that every member that is not a leaf will cause an error, which is something we don't want the viewer to see. If the dimension has properties for members in other levels too, you can adjust this decleration. This member can be declared either in the DataBase's Script (after the CALCULATE expression) or inside the session/query (not recommended in Panorama). Now, all you have to do is to show the dimension's members in the rows and this new measure in the columns (after or before the regular measures), and you'll get what you want.

Monday, October 01, 2007 4:47:12 AM (Jerusalem Daylight Time, UTC+03:00)
 Monday, September 24, 2007
My friend, Ilya, had a problem in SSIS. He had a .csv file with too many commas. The meaning is that strings that started and ended with inverted commas (") and had commas inside it were recognized by SSIS as new column. For example, the row:
"My name, is Miky", 200, 10 was recognised by SSIS as four columns instead of three. Ilya wrote down a code for SSIS (in VB) that run before the package begin its work. Here it is, hope it will help who ever seen this.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO
Imports System.Text
Imports Microsoft.VisualBasic.FileIO
Public Class ScriptMain
Public Sub Main()
  Dim csvFileFullPath As String
  Dim tabFileFullPath As String
  csvFileFullPath = Dts.Connections("Your CSV Connnection").ConnectionString
  tabFileFullPath = Dts.Connections("Your Table Connection").ConnectionString
  Using tabStreamWriter As New StreamWriter(tabFileFullPath, False, System.Text.Encoding.GetEncoding(1255))
  Using csvFileReader As New StreamReader(csvFileFullPath, System.Text.Encoding.GetEncoding(1255),True)
  Dim currentRow As String

  currentRow = csvFileReader.ReadLine()
  tabStreamWriter.WriteLine(currentRow)
  While Not csvFileReader.EndOfStream
    Dim outputRow As New Text.StringBuilder()
    Dim tmp, tmp1 as String
    Dim offset as Int32 = 1
    Dim beginS, endS As Int32

    beginS = 1
    currentRow = csvFileReader.ReadLine()
    beginS = InStr(offset, currentRow, """")
    While Not beginS = 0 Or offset > Len(currentRow)
      endS = InStr(beginS+1, currentRow, """")
      tmp = Mid(currentRow, beginS, endS - beginS)
      tmp1 = Replace(tmp, ",", " ")
      currentRow = Replace(currentRow, tmp, tmp1)
      offset = endS + 1
      beginS = InStr(offset, currentRow, """"")
    End While
    outputRow.Append(currentRow)
    tabStreamWriter.WriteLine(outputRow.ToString())
  End While
End Using
End Using
Dts.TaskResult = Dts.Result.Success
End Sub
End Class

The solution here is to search for any comma (,) that is between two inverted commas (") and replace it by space.
Although it is a good solution, I would take another solution: Replace any comma by special string, such as &Miky&, convert the csv file into table, and after that go over that column(s) and replace any &Miky& by comma.

Monday, September 24, 2007 7:31:56 PM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, September 09, 2007
I was asked how to get to SSIS log to see how much time took for the package to run.
Well, that depends.
On Development:
When developing new package, after running the process (click on the green arrow or press F5) there's a new tab called Progress. Clicking it will show you everything about the package's execution, including the time it started and the time it finished.

On Production:
When developing the package, open the SSIS menu (Yes, there is a menu called as the product's name. Microsoft...) and click on Logging... There, you can define logs for your package. You can log in many ways: Writing to SQL sever, output file, XML file and more. I recommend logging into SQL server and logging only the big and "hard" parts in your data flow. In the Details tab, pick up only the exceptional events, such as onError, onTaskFailed and onWarning. If you wish to know how much time took for you package to run, also pick up onProgress.

Follow this link to read about every event in SSIS.

Monday, September 10, 2007 6:09:23 AM (Jerusalem Daylight Time, UTC+03:00)
 Sunday, July 15, 2007
When designing a dimension in Analysis Services, there's a funny button called Add Business Intelligence. Clicking it opens a beautiful wizard which letting you define some basic properties of the dimension such as ordering and enabling writeback. I'll take the writeback feature as an example: When doing this using the wizard, it's taking you through many screen where all you have to do except for clicking next, next is to check a checkbox in one of the screens. That's it. After that, all this long wizard doing it setting a property called WriteEnabled to True. I think that it's a strange software design of Microsoft. Maybe it's for making the product seem more professional. You know - Add Business Intelligence sounds like a heavy operation. Anyway, I don't know what they had in mind.

     
Sunday, July 15, 2007 7:48:56 AM (Jerusalem Daylight Time, UTC+03:00)
I never thought that I'll do a commercial to Microsoft, but Project Real is a great thing that they did and they should get the credit for it. This project is a full end-to-end BI solution, including ETLs (using SSIS), Analysis Services cubes and mining modules, Reporting Services reports, end-user Panorama views and more.

We work with Panorama as our main GUI tool to show our users the cube's data as tables, charts, dashboards, etc, so this project is really helping us to learn how to implement our project from the first ETL step all the way to the last Panorama step.

Recommended.

Sunday, July 15, 2007 7:05:54 AM (Jerusalem Daylight Time, UTC+03:00)
 Thursday, May 31, 2007

In the data source view, when you edit the SQL by yourself (right click on the table in the data source view, Replace table -> With new named query) be careful when using Oracle DB tables. When writing in the SQL the Oracle's table name with small casing, the SQL parser will add commas to the table name, making the SQL not work because the Oracle does not recognize this table name (with the commas).

Solution: Enter the table name with big casing, which will make the parser leave the table name as is. Also, remember: when creating or editing a named query always check and syntax, but also run the query and check that you get the desirable result before hitting the OK and saving the new named query.

by the way, I wonder: what were we doing if the Oracle was case sensitive?

Friday, June 01, 2007 6:36:37 AM (Jerusalem Daylight Time, UTC+03:00)

Few months ago we were given an assignment to copy/move all our DTSs that were running with SQL Server 2000 to the new SQL Server 2005 Integration Services (SSIS). My friend Michael did it and wrote some important notes that he discovered when building ETLs with Integration Services. I decided to list them here because they are important and useful, especially for those who haven't got the time to develop with SSIS so far.

  • One of the greatest improvements in SSIS is that between the source and the destination of the ETL process you can do many things, such as making new fields, sorting, converting data types, union all between different sources, implement your logic on a field, and much more. This is much easier than ever because all you need to do is to add a block to the data flow task and define it for your purposes.
  • SSIS ships with a tool for migrating SQL Server 2000 DTSs. Do not use this tool. Sometimes the result of the conversion is not good enough and in all cases you can't edit the new migrated data task.
  • When making a connection to a non-Microsoft DB, such as Oracle, use OLE DB client instead of the out-of-fashion ODBC.
  • When the destination field is shorter (string type) than the source, add a data conversion block and cut the string. Otherwise, there will be an annoying warning even if the truncation is wanted.
  • In many times (when working with non-Microsoft providers) the automatic recognition of the length & types of the source fields is not correct. Enter the source block and edit these properties by yourself.
  • When moving a Unicode field (data type DT_WSTR) to a non-Unicode destination field (DT_STR), a data conversion block is required.
  • SQL Server 2000 Stored Procedures will work in SSIS, but the Linked Servers definitions are problematic. Consider another options rather than using linked servers.
  • When the source/destination is a CSV file, use Flat File Connection. But if it is an Excel file (.xls), use a Microsoft Jet OLE DB connection and define the source as OLE DB Source (yes, it will work with Excel files).

Again, thanks for Michael for making and sharing these notes.

Friday, June 01, 2007 6:17:10 AM (Jerusalem Daylight Time, UTC+03:00)

For some weeks we were fighting with (or against?) the Panorama software in order to make it work right and show a nice pilot of BI dashboard screen to our managers. After three weeks we managed to show a good opening position by building a nice dashboard screen including a map, graphs, gauges and crosstabs.

I found that the installation of the SQL Server and the Panorama server was not good enough in my company, so I decided to try it for myself. The installation of the SQL Server is quite easy (Next, Next, Next ...), but installing the Panorama server is a complicated process. Paying attention to so many small details, knowing and remembering what to do inside the Windows server, IIS, Windows services, Windows registry and more is not so easy. Finally, after two days I managed to do this. Now I know that some things in the installation in my work place were not so good and I can point them out.

You can see the exciting (for me, at least) result in this picture, as it is a nice dashboard taken from my screen. Many posts about Panorama and SQL Server 2005 will come ahead. Now I can relax - Panorama is on the way...

Friday, June 01, 2007 5:09:10 AM (Jerusalem Daylight Time, UTC+03:00)