Monday, January 21, 2008
Today I was in the Microsoft's Data Mining Conference which took place in the Sheraton City Tower, Ramat-Gan (Israel, of course). First of all - the food was good. :-) Now, seriously: All the lectures were great, although they were performed by one man - Rafal Lukaweichi, which is a very talented speaker. I think that I haven't seen such enthusiasm for many years in a lecture in the IT world. Anyway, what have I learned today?
  • The Data Mining world is very interesting indeed. Microsoft has a lot to offer in DM and it is all ready-to-use in BIDS.
  • Microsoft's approach is DM to the masses, which I don't believe it myself. Even though the tools are very simple and even the code (DMX) is easy (in contradiction to MDX), I don't think that an inexperienced developer can bring good results. The SAS approach says that you need to have deep knowledge in statistics (which is bad), but I don't think that DM can be made by the masses.
  • There are many different DM algorithms which you need to be introduced with before you start mining. As I mentioned, they tell you that all you need to know is what each algorithm does in general, but in fact there are many parameters which you need to adjust and play with, so in fact you need a good knowledge of these algorithms.
  • Visualization is very important in DM. Even after you have good results in your hands, you need good UI tools to show you the results in an efficient way or else you'll be lost in a jungle of data.
  • If you already have a datawarehouse, you're half way from mining models. The preparation of the data is a huge amount of the job in DM.
  • After you have good results and even after you got good visualization of the results, you need an expert from the company you work for (or in) that will look at the results and tell you whether they bring new knowledge or they trivial.
I don't believe there's a change I'll be mining in the near future, but maybe I'll play with it a little in my free time (which of course I don't have). Taking the data of our datawarehouse and mining it can bring some interesting stuff. Who knows.

Tuesday, January 22, 2008 3:06:14 AM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, January 15, 2008

Although SSAS willl let you use them, some other application such as Panorama won't function properly. I'll give one example: When you perform Drillthrough in Panorama, on the fly the engine gererates a web page that will take the user to the next view. The next view will be sliced as the current view, so this web page needs to pass the dimensions parameter (meaning - the current slices). That's why this web page contains this line:

AttachParameters("Slicers","%Slicers%")

A big problem will occur if one of the sliced dimensions will be sliced on a member with a name that contains inverted commas ("). The JavaScript will result an error because there are three inverted commas in the second parameter of the line. This is only one example of what can happen in a BI consumer program if you'll use special characters in member names. So - be careful not to pass these characters from the DW (build the ETL so it will drop these characters) or giving those names in SSAS, such as the All member name.

Tuesday, January 15, 2008 11:08:32 PM (Jerusalem Standard Time, UTC+02:00)
 Tuesday, January 08, 2008

I feel like I don't have the right to write about it after so many bits of information were moving in the web about this subject but I would like to add my point of view (or in fact - my point of code). First of all, I must mention some of those who written about this subject before me. Mosha Pasumansky wrote a long post about it in last may. This post contains some ideas of how to come along with this problem, but none of them is perfect. In fact (as always) - there is no perfect solution for this problem. Another important source of knowledge can be found here in the MSDN forums, where Chris Webb, Thomas Pagel and others discussed it. Now, I would like to add my solution. Take it or leave it - your choice.

First of all, create a column in the time dimension that will be the current day indicator. Thanks to the Data Source View or the UDM approach of SSAS 2005 you don't have to change the relational table itself. Just add a named calculation in the DSV with expression that will be 1 for the row of the current day and null or zero for the others. The expression syntax itself depends on the underlying DB so I won't write it here, but it's very simple. Add this column as an attribute (let's call it CurrentDayInd) in the dimension structure and select its AttributeVisible propertiey to false. That's because we don't need such Attribute Hierarchy in our time dimension. After that, create a new User Hierarchy (you can call it Current Day), where the first level in it will be CurrentDayInd and after it place the day (key) attribute.

Now, what we have here? We have a hierarchy (Current Day) with two members - 1 and zero. The 1 member has only one child which is the current day. Link that member to your regular hierarchy (it's called YSQMD at my place) and here you have it. For example, you can use it that way in the MDX script:

Create Set [Last 30 Days] as
LinkMember([Time].[Current Day].[CurrentDayInd].&[1].Children.Item(0),[Time].[YSQMD])
:
LinkMember([Time].[Current Day].[CurrentDayInd].&[1].Children.Item(0),[Time].[YSQMD]).Lag(30);

I think that this solution is good and elegant. First of all, you don't have to use external functions such as Now(). The second pro is that we use the native OLAP mechanism which saves time and makes the queries run faster. Believe me, you'll feel the difference with big cubes. The last thing is that this solution is easy to understand (at least I think so) and it is easy for maintenance. The big con is that you have to process this dimension (and related cubes) every day. I don't think that it's so bad because most of the organizations do make process every day.


 

Wednesday, January 09, 2008 5:39:58 AM (Jerusalem Standard Time, UTC+02:00)
 Monday, January 07, 2008

For the last months I rejected over and over again invitations to Facebook. The reason is that I think that it's just a big waste of time. I finally surrender when I figured out that I'm missing many things that happens on the virtual world. For example, my friend published pictures of my daughter and everyone seen it except... me. Everyone talked about how cute she is in those pics, people who I don't know say hello to her with a smile and I don't know who they are and how they know Renana...

After joining the social network I've seen for myself the huge waste of time. Instead of making this network a platform for distant people to connect to each other, I see that everybody is busy with adding many stupid applications to their profile pages, annoying each other with nonsense (I'm saving an alien, raising a pet, you name it) and playing endless quizzes.

I decided that after filling my profile with minimal amount of apps I will calm down and use Facebook to do what it's meant to do - connect with my friends. Friends - Stay in touch !

Tuesday, January 08, 2008 5:50:50 AM (Jerusalem Standard Time, UTC+02:00)

In the previous parts (1, 2) I showed how to connect Informatica with MS-OLAP, meaning that a mapplet can process cube or dimension. The thing is that I focused on the side of MS-OLAP. In the second part I even wrote the T-SQL code itself. Now I want to close the loop by describing what's going on in the Informatica side. This part was made by my friend, Alex, who permitted me to post here about what he did.

First of all, there's a table which contains the parameters to call with to the MS-OLAP procedure (object id, type, user name, etc.). This table is the source (& source qualifier, of course) of the mapplet. Each row in this table calls the stored procedure in the MS-OLAP side (in fact, the procedure is part of the relational DB, but never mind now). The call to the SP is made with Informatica's Procedure block. The connection is a regular ODBC connection, as mentioned in the previous part. Now for the interesting part: In the mapplet, the result of the procedure (zero for success, one for failure) goes into a Java Transformation block. This java block will fail the mapplet if one or more procedure calls returned failure.

How to do this java block? Double click on it to enter its properties. Go to the "Java Code" tab. There you'll see tab for every event in this block's life cycle. Here is the code for every tab (only the relevant ones):

Helper Code:

static int errorCounter = 0;
static Object lock = new Object();

On Input Row:

if (returnValue != 0)
{
 synchronized(lock)
  {
   errorCounter++;
  }
}

On End of Data:

synchronized(lock)
{
 if (errorCounter > 0)
  {
   failSession("OLAP Objects failed");
  }
}

Note that:

  • I'm not sure that the lock mechanism is required here. sync, lock, semaphore, etc. mechanisms are often used when atomic write is needed in order to solve problems like deadlocks, mutual write, etc. Here I simply don't care. Even if two parallel threads will read the errorCounter as zero and they both will increase it to one (when in fact it needs to have the value of two) it won't be a bug because the session will fail anyway. Alex & I need to talk about this point...
  • failSession is a function which is part of Informatica's API. As you might guess, it will fail the whole mapplet.
  • Very important: Calling all the MS-OLAP objects at once will cause an error in the Analysis Services server and all the objects will be in the state of Unprocessed. The Informatica side has to call the dimensions first and only then the cubes. The cubes must not be called all at once if they have relationships between them. This will cause deadlock too.
Tuesday, January 08, 2008 5:35:17 AM (Jerusalem Standard Time, UTC+02:00)
 Saturday, January 05, 2008

The Panorama NovaView Desktop program can't always deal with huge crossjoins. The reason is that this program is written in VB6 - very old platform for client programs. One thing you can try is to go to the crosstab properties and in the Advanced tab, click on the "Optimize huge crossjoins". The problem is that this won't always help. The best solutions I've found so far is to go to the Web Access site (or click on the IE button in the desktop program) and there you can choose the size of the chunk of data you'll receive on every click. Starting with 100 rows in the first chunk, this may help you with huge crossjoins.

 | 
Sunday, January 06, 2008 5:48:11 AM (Jerusalem Standard Time, UTC+02:00)