Jul 082013
 

This post is for those who moved from Informatica to DataStage, as I did.

When you want to extract your data from a single source to multiple flows or targets, you should forget what you did with Informatica. There, you could easily connect the source to multiple flows and that’s it. In DataStage, every source stage’s output has a different tab in the source stage’s properties. Defining the same output (table name, WHERE clause, etc.) twice will make the SQL run twice in the source DB (!). As I said, the way to do this is different in DS: You should create a single output on the source stage and then create a Transformer stage with multiple outputs.

When I first encountered this method, it seemed preety ugly compared to Informatica. After I moved on with it and got some experience with DS, I found it has a great advantage: You can use the Transformer stage to define a different constraint to every output. In Informatica, you had to make a different Filter to every link. In DS, this “partition logic” is centralized in one place.

 Posted by at 9:55 pm
Jun 132012
 

DataStage contains a very impressive repository search feature. Say you want to search for jobs that use a certain table, you can do it with few clicks. The problem is that the repository search only looks at table names you typed in the “Table Name” field (in the source or target stages). Usually, we use “Custom SQL” and type in our own SQL queries, and DataStage doesn’t scan it when you look for tables in the repository search. So, what can you do?

The trick I found works in many technologies, not just DataStage. You can export the whole project’s jobs to XML file (right click on the root of the project -> Export). This XML will probably be a big one (90M in my case), so notepad will get stuck while trying to find an expression in the file. You can use Notepad++ or Textpad and you’ll find your table in seconds even if it’s in Custom SQL.

Enjoy.

 Posted by at 8:50 pm
Jun 112012
 

The best thing we can make from our mistakes is learning, but it’s much more fun to learn from others’ mistakes, so you better listen. These two mistakes were made by different organizations, but it can happen to all of us.

First Story

The organization installed Oracle’s BI suite in a cluster which consists from two servers. Oracle Enterprise Management (EM) is their basic monitoring tool, so they decided to install in “on the way”. The only problem was that they installed it on one of the cluster’s machines. Now you can guess what happened.  Unfortunately, the machine with the EM went off in the middle of the day and the IT staff wanted to see the error logs, of course. The only problem was that the EM was down also. All they could do is the bring up the machine and look in the logs between the failures. The Lesson: You must separate the monitoring system from the systems they monitor. They can’t reside on the same physical machine.

Second story

The other organization created its own monitoring system. When an ETL process starts, it notifies the monitor that it started and it makes the same thing when it ends. The monitor system awakes once in a while and checks for processes that started and didn’t finish for a long time (depends on the process, of course). The IT people got used to enter the monitor every morning and check for processes that didn’t finish. The only problem was with a process that didn’t start at all  so the monitor didn’t know that there’s a problem at all (!!). It took a couple of weeks for the users to find out the data is not up-to-date and it wasn’t nice. The Lesson: You must separate the monitoring system from the systems they monitor. The process shouldn’t notify the monitor. The monitor needs to check the system independently.

I’ll be happy to read more stories of the same manner. Feel free to comment.

 Posted by at 9:38 pm
Jun 112012
 

As you can see, the last post was written almost half a year ago. I asked myself why and got to the conclusion that the technical issues which involves writing makes me postpone the writing more and more. I decided to try the famous 5 minutes installation of WordPress and it actually worked, especially with GoDaddy hosting services, which makes the installation for you.

These instructions helped me a lot and I moved all the posts, comments, images from dasBlog to WordPress in minutes. The only problem was moving the permament link of dasBlog. I tried WordPress’ redirection plugins but for some reason they didn’t work. The solution was a simple web page which replaces the old blog which makes 301 (moved permamently) responses. This way, the permalink moves the user to the new WordPress blog.

One more tip: before moving the posts from any blog framework to another, rewrite the images (and any resources) urls.

I hope you’ll enjoy the new look and most important – the new content.

 

 Posted by at 9:11 pm
Sep 142011
 

You must be wondering why this blog wasn’t updated for a long time. The reason is that I started a new job. :-)  Along with other stuff, I’m playing with QlikView and I’d like to share some interesting ideas.


In QlikView, The list boxes which filters the entire sheet (or document) is a very common thing. I thought that if we have a box list with a small amount of values, it might be more beautiful to replace each value with an image. For exmaple, let’s say that we have three divisions in the company. Instead of putting a list-box with their names, we can put three images with their logos and make that a click on the image will filter on these values (very simple with the Actions tab). The problem is that the user won’t know on which division the sheet is filtered on. That’s why we need to make the other images more transparent. Using macro, that’s very easy and the result is awsome:


sub SetTransparentImage
  set image = ActiveDocument.GetSheetObject(“ImageIDHere”)
  set imageProps = image.GetProperties
  imageProps.Layout.BkgAlpha = 50
  image.setProperties imageProps
end sub


Now, put that in the macro script inside your document. In order to add images to your sheet, add new Text boxes and set their background to the image you want. In the Actions tab, set the desired value on click and also run this subroutine.

Enjoy.

 Posted by at 2:23 pm
Jun 152011
 

For the google search: Informatica service start and stops


Yeah, that’s exactly what happened to me. Everything looked just fine until the last step of the installation, where the installer pings the new integration service, but there is no answer.
Looking at the services manager, I see that the InformaticaService is stopped. When I restart it, it stops after 5-10 seconds. I thought it’s because of the service’s account settings, but it’s not. After some time of search and re-attempts to make it work, this is what I can say if you’re stuck as I was:


  • While you’re in the installation wizard, on the node configuration step, press the “Advances Settings” button. There, you’ll see the min. and max. port numbers. Go back to the node configuration and change the port number so it will be in that range (why it wasn’t in the first place? …)

  • The Windows PATH environment variable must contain the informatica server binary folder: <INFORMATICA_INSTALL_DIR>\Server\bin

  • If you’re still stuck, look after the service.log file in the informatica directory. Look at the error and google it.
Hope I saved somebody a few hours.

 Posted by at 8:57 pm
Jun 022011
 

Yes, and it’s great!


Finally, the semantic web is here. This means that from now on, the search engines won’t have to guess what is the content of the (our?) web pages. The answer will be there, behind the scenes.
Technically, we (the web developers) will add new properties to our HTML tags that will describe the content of the text. For example, you can add a property to your DIV or SPAN that will tell the search engine that this is an article, creative idea or a critique.


What is the meaning of this for us, the web developers or independent bloggers? That depends. If you want to make your blog (or site) to appear in more relevant search results, you should add these properties. Yes, you’ll have to manually edit your HTML or to make a template. Of course you don’t have to do this.
For those who want to have a deeper understanding of this subject, go to the common standard site – Schema.org

 Posted by at 8:08 pm
May 232011
 

After six years of work with SSAS, I decided to start re-reading the best book on the subject – Microsoft SQL Server 2005 Analysis Services. Why read about 2005 when there are more recent versions? That’s because this is the most comprehensive book on the subject and the only one that was written by the SSAS’s development team. The book gives you a better understanding of the technlogy and shows you the motivation for every architectual decision of the creators. While reading, I come across subjects that I wanted to post about and I believe that along with reading the book I’ll post more.


The book cover

At the university, I was taught that half crazy is more dangerous than crazy. The reason is simple: You know that you should always be careful from the crazy guy, but regarding the half crazy person you don’t know when to be careful and when to learn from him. Information systems are the same and especially DWH: There is no half truth or not precise truth. If you know that there’s a data problem – bubble it up and don’t show you user half a truth (because it’s also half a lie).

Every time we bind a DB column to a dimension attibute, a DataItem object is created. The DataItem defines the connection between the conceptual and the physical model. For example, it defines from which table and column the data is taken. Along with other properties, there is one important property called NullProcessing which defines how the server behaves with null data. The default is ZeroOrBlank, meaning that the null data will be shown to the end-user as zero (if the DataType is not string) or empty string (if the DataType is string). The problem is in case where the attribute is a parent in an hierarchy. This is what happened:
We defined “Sales Amount” measure under the hierarchy Category -> Sub-Category -> Product. By mistake, one of the members in the categoty level got null value and it was shown as zero in the cube. The user was used to see the Shoes sub-category under the Clothes category. One day, he seen that the shoes sales amount was decreased dramatically – He seen products under the Shoes sub-category which is under the Clothes category but the sub-total was decreased. In the bottom of the table other products were “hidden” with the Shoes sub-category but with the zero category. The reason is that some of the shoes products were given null categories and in the report the shoes products were split between the Clothes category and the zero category. The user got half truth and he interpreted it as wrong data and that should never happen in a BI system.

What can we do? In my opinion, in most cases we need to set NullProcessing as Error, which will throw an error and prevent us from showing wrong data. This is how we (the IT people) will know about the problem before the end-users.
There are more options to NullProcessing:


  • Preserve – keeps the null value with no replacement. As I shown, this is not recommended. Furthermore, preserve uses more system resources.

  • Unknown Member – replaces the null with the unkown member you defined in the dimension. I recommend using this option only if the users know that the unknown member exists and what it means.

  • Automatic – The server will select the most appropriate behaviour according to the context. I don’t really understand what it means and the writers don’t explain it, but generally it is recommended to define deterministic behavior to your system.

Does it mean that now we’ll go over all our data bindings and change their NullProcessing proprety? Of course not ! We need to think where this behavior has a bad effect and where it doesn’t matter. For example, in a key attribute of a dimension which is also a leaf level of an hierarchy, the ZeroOrBlank option is quite good.

 Posted by at 10:56 am
Mar 292011
 

We’re doing something very useful at our workplace and because I believe in collaboration I would like to share it with you, dear readers. I’m sure it will help a lot of system analysts and developers.

 

When we write down the design of the ETL process, especially regarding complex fact tables, we find it necessary to write it down as a flowchart. This creates a simple and common language between the designer/analyst and the developer and even with the customer. It’s much easier for everyone to read a flowchart rather than reading a long and complex document. Therefore, we create diagrams such as these using Visio (this diagram was created using Google Docs, but it’s preety much the same):

 

I only brought here the beginning of the diagram. The real one is 3 times bigger.
Note that every leaf in the tree has its own number. These numbers indicates the record’s “case number”. Now for the trick: In the target table (fact, dimension or anything else) we add a new column that contains the number of the case, according to the numbers shown in the diagram. This field is not exposed to the end-user. The pros of this trick are very significant:

 

  • Already in the early stages of the implementation we can check the statistics, meaning how many times each case happens and see if we got what we expected. This can help us find new things that we (and sometimes even the users) weren’t aware of and find bugs in early stage of the development.
  • Each record can “tell its story” and tell us “why it was created”. This saves us a lot of effort when we want to question the data and find the source of the problems.
  • During the tests we check two different things: Whether each record was “identified” correctly by the ETL (by checking the case number) and whether the values entered into the other fields are correct. While once we used to check only the values correctness and then find the problems, now we can detect the problems much faster.

After we came up with this idea, I thought of another one: Why won’t we expose this new field to the users/analysts and let them “enjoy” this knowledge? Maybe these statistics can also help them with the data analysis? This requires us to build “cases” dimension and add a description for every case number. I don’t think that this is a good idea in any circumstances, but you should know that it’s an option.

 

 Posted by at 1:20 pm