Nov 022014
 

One of the best ideas I’ve heard over the years and implemented right away is quality gates – after almost each step of the ETL, we stop for a while and check that everything is OK until now. For example, after we take the daily data from the source system we check that the amount of records which has been fetched from there is what we expected. Another example is full records count: we check that the total amount of records in the DWH table is equal to the total amount of records of the parallel table in the source system after filtering, cleansing, etc.

After a few years of experience with the concept I can say that it has a major bug: It takes too long. Every ETL process, especially when the data has to be up to date, has to finish as soon as possible. In every minute of the process it can fail for a lot of reasons. For exmaple, the integration server can reach some kind of a limit (CPU, memory, # of parallel processes), the DB can lose contact or the destination table can get locked. This is why we want that the ETL will be as quick as possible, which is contradictory to the quality gates concept.

In order to solve the problem, I’m about to take the quality gates and build them as a post-process which will run after the ETL finishes and the data has been deployed to the clients. If this process will find a problem it will notify us immiedetly (or even send a mail to the clients in serious scenarios). Pay attention that for this to work we need to save the middle-process tables and the stage tables because we’ll want to count them and compare them to the source system. This solution is not perfect – when a problem will occur (and it will happen for time to time) we’ll have some time when we’ll show our clients not-complete or too-much-complete data, but I think that it’s the best we can do for now.

We’ll not sacrifise the quality of the ordinary daily process for the one-to-hundred times when we have problems.

 Posted by at 1:04 am
Jun 112014
 

It’s called synonym both in Oracle and in SQL Server – a shortcut to another table.

The only certain thing in information systems is the consistent change. You should always prepare for any change that may occur and develop your ETL to be future-changes-driven. One of the ways of doing that is to set your source to synonym and not to a real view or table. This will allow you to change the synonym to point to another table, synonym or view when you need it.

You may think that a view can do the same thing, but the problem is (and I can say it after many years of ETL development) that after some time the developers will always want to change the view and after some time it will become invalid for a reason (some error in function, another table that has been dropped, DB link that became invalid, etc.). The big pro of synonym is that you can’t do it there. meaning, the con of synonyms is a pro for us.

 Posted by at 8:46 pm
Jul 082013
 

This post is for those who moved from Informatica to DataStage, as I did.

When you want to extract your data from a single source to multiple flows or targets, you should forget what you did with Informatica. There, you could easily connect the source to multiple flows and that’s it. In DataStage, every source stage’s output has a different tab in the source stage’s properties. Defining the same output (table name, WHERE clause, etc.) twice will make the SQL run twice in the source DB (!). As I said, the way to do this is different in DS: You should create a single output on the source stage and then create a Transformer stage with multiple outputs.

When I first encountered this method, it seemed preety ugly compared to Informatica. After I moved on with it and got some experience with DS, I found it has a great advantage: You can use the Transformer stage to define a different constraint to every output. In Informatica, you had to make a different Filter to every link. In DS, this “partition logic” is centralized in one place.

 Posted by at 9:55 pm
Jun 132012
 

DataStage contains a very impressive repository search feature. Say you want to search for jobs that use a certain table, you can do it with few clicks. The problem is that the repository search only looks at table names you typed in the “Table Name” field (in the source or target stages). Usually, we use “Custom SQL” and type in our own SQL queries, and DataStage doesn’t scan it when you look for tables in the repository search. So, what can you do?

The trick I found works in many technologies, not just DataStage. You can export the whole project’s jobs to XML file (right click on the root of the project -> Export). This XML will probably be a big one (90M in my case), so notepad will get stuck while trying to find an expression in the file. You can use Notepad++ or Textpad and you’ll find your table in seconds even if it’s in Custom SQL.

Enjoy.

 Posted by at 8:50 pm
Jun 112012
 

The best thing we can make from our mistakes is learning, but it’s much more fun to learn from others’ mistakes, so you better listen. These two mistakes were made by different organizations, but it can happen to all of us.

First Story

The organization installed Oracle’s BI suite in a cluster which consists from two servers. Oracle Enterprise Management (EM) is their basic monitoring tool, so they decided to install in “on the way”. The only problem was that they installed it on one of the cluster’s machines. Now you can guess what happened.  Unfortunately, the machine with the EM went off in the middle of the day and the IT staff wanted to see the error logs, of course. The only problem was that the EM was down also. All they could do is the bring up the machine and look in the logs between the failures. The Lesson: You must separate the monitoring system from the systems they monitor. They can’t reside on the same physical machine.

Second story

The other organization created its own monitoring system. When an ETL process starts, it notifies the monitor that it started and it makes the same thing when it ends. The monitor system awakes once in a while and checks for processes that started and didn’t finish for a long time (depends on the process, of course). The IT people got used to enter the monitor every morning and check for processes that didn’t finish. The only problem was with a process that didn’t start at all  so the monitor didn’t know that there’s a problem at all (!!). It took a couple of weeks for the users to find out the data is not up-to-date and it wasn’t nice. The Lesson: You must separate the monitoring system from the systems they monitor. The process shouldn’t notify the monitor. The monitor needs to check the system independently.

I’ll be happy to read more stories of the same manner. Feel free to comment.

 Posted by at 9:38 pm
Jun 112012
 

As you can see, the last post was written almost half a year ago. I asked myself why and got to the conclusion that the technical issues which involves writing makes me postpone the writing more and more. I decided to try the famous 5 minutes installation of WordPress and it actually worked, especially with GoDaddy hosting services, which makes the installation for you.

These instructions helped me a lot and I moved all the posts, comments, images from dasBlog to WordPress in minutes. The only problem was moving the permament link of dasBlog. I tried WordPress’ redirection plugins but for some reason they didn’t work. The solution was a simple web page which replaces the old blog which makes 301 (moved permamently) responses. This way, the permalink moves the user to the new WordPress blog.

One more tip: before moving the posts from any blog framework to another, rewrite the images (and any resources) urls.

I hope you’ll enjoy the new look and most important – the new content.

 

 Posted by at 9:11 pm
Sep 142011
 

You must be wondering why this blog wasn’t updated for a long time. The reason is that I started a new job. :-)  Along with other stuff, I’m playing with QlikView and I’d like to share some interesting ideas.


In QlikView, The list boxes which filters the entire sheet (or document) is a very common thing. I thought that if we have a box list with a small amount of values, it might be more beautiful to replace each value with an image. For exmaple, let’s say that we have three divisions in the company. Instead of putting a list-box with their names, we can put three images with their logos and make that a click on the image will filter on these values (very simple with the Actions tab). The problem is that the user won’t know on which division the sheet is filtered on. That’s why we need to make the other images more transparent. Using macro, that’s very easy and the result is awsome:


sub SetTransparentImage
  set image = ActiveDocument.GetSheetObject(“ImageIDHere”)
  set imageProps = image.GetProperties
  imageProps.Layout.BkgAlpha = 50
  image.setProperties imageProps
end sub


Now, put that in the macro script inside your document. In order to add images to your sheet, add new Text boxes and set their background to the image you want. In the Actions tab, set the desired value on click and also run this subroutine.

Enjoy.

 Posted by at 2:23 pm
Jun 152011
 

For the google search: Informatica service start and stops


Yeah, that’s exactly what happened to me. Everything looked just fine until the last step of the installation, where the installer pings the new integration service, but there is no answer.
Looking at the services manager, I see that the InformaticaService is stopped. When I restart it, it stops after 5-10 seconds. I thought it’s because of the service’s account settings, but it’s not. After some time of search and re-attempts to make it work, this is what I can say if you’re stuck as I was:


  • While you’re in the installation wizard, on the node configuration step, press the “Advances Settings” button. There, you’ll see the min. and max. port numbers. Go back to the node configuration and change the port number so it will be in that range (why it wasn’t in the first place? …)

  • The Windows PATH environment variable must contain the informatica server binary folder: <INFORMATICA_INSTALL_DIR>\Server\bin

  • If you’re still stuck, look after the service.log file in the informatica directory. Look at the error and google it.
Hope I saved somebody a few hours.

 Posted by at 8:57 pm
Jun 022011
 

Yes, and it’s great!


Finally, the semantic web is here. This means that from now on, the search engines won’t have to guess what is the content of the (our?) web pages. The answer will be there, behind the scenes.
Technically, we (the web developers) will add new properties to our HTML tags that will describe the content of the text. For example, you can add a property to your DIV or SPAN that will tell the search engine that this is an article, creative idea or a critique.


What is the meaning of this for us, the web developers or independent bloggers? That depends. If you want to make your blog (or site) to appear in more relevant search results, you should add these properties. Yes, you’ll have to manually edit your HTML or to make a template. Of course you don’t have to do this.
For those who want to have a deeper understanding of this subject, go to the common standard site – Schema.org

 Posted by at 8:08 pm