Tuesday, June 22, 2010

This is the fourth post about Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the third part of the seminar which discussed SSIS.

  • Log everything that happens inside the package. Use the SSIS log providers and the event handlers. The effect on performance is not significant.
  • Analyize the log data. You can analize total execution time, SSAS partition processing time and much more.
  • Log detailed error information about rejected rows. You can even log the actual error row.
  • Monitor also the execution of the ETL jobs.
  • Data Profiling is a new feature in SSIS 2008. It's used for analysis of the data's quality in the DWH. Go and learn.
  • Configure everything you can in the package configuration, such as: tasks, containers, variables, connection managers and data flow components. You can use XML config file, config from parent package, registry configuration or even configuration from a table.
  • The Cache Transformation saves data in the cache. It's very useful in lookups.
  • Data types: Make them as "small" as possible (better performance). Be aware of precision issues when using money, float and decimal types.
  • Don't sort in SSIS unless it's absolutely necessary.
  • Sometimes using T-SQL will be faster then running it in SSIS.

Some of the tips are also relevant in Informatica. For example, I always let the DB make the sort whenever possible. Sometimes it's also works with joins. Running packages or procedures can be much more faster than running it in the ETL, but if you look at it from maintenance point of view you'll do it in the ETL. It depends on much factors, such as the time you can have to run the ETL, the knowledge in your team and more.

Tuesday, June 22, 2010 10:40:51 PM (Jerusalem Daylight Time, UTC+03:00)

This is the third post about Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the second part of the seminar which discussed SSAS design, but here I'll focus on near real-time cubes. I decided to write an independent post about near real-time cubes because this subject deserves its own place. By the way, I personally think that this subject is very interesting.

There are several solutions to implement near real-time cubes:

  1. Processing only the last partition. Pros: Trivial and simple. Cons: Might impact users in terms of resources and cache.
  2. Small real-time cube. Pros: Simple and elegant solution. Cons: It's small, so it can't contain much data even if you need to. This is the main reason why people usually don't do it.
  3. Backend processing server. Now it's getting interesting. The backend server is only used for processing. The front-end server(s) is used for querying. Pros: Good performance on query time. Cons: You need to take care of syncing the cubes, using XMLA sync, physical file sync (SAN snapshots) or any other technology. The cache is deleted with every sync.
  4. Flip-Flop. It's like the previous solution, only here we switch between the servers. Pros: It enables you to check your data/cube before making the switch. Good performance on query time. Cons: You need to take care of changing the connection string in the client side or implement NLB solution. It might disconnect users or you need to take care of waiting for the end of the session and only then change the connection string.
  5. ROLAP partition. Pros: No processing at all! Cons: Bad, bad performance.

If anyone have implemented one of these I'll be happy to hear.

Tuesday, June 22, 2010 10:17:34 PM (Jerusalem Daylight Time, UTC+03:00)
 Monday, June 14, 2010

This is the second post about Itay Braun's seminar in the SQL & BI conference. In this post, I'll write some notes I collected from the second part of the seminar which discussed SSAS design.

  • There's a very common misunderstanding about what is an entity. You can see it when there are a lot of dimensions. If there are "car color", "car manufacturer" and "car creation year" dimensions it seems you've got a problem. If there are dimensions with only key and description attributes (or worse - only one description attribute) then you've got a serious problem.
  • Design to Scale: Often the performance problem are caused by the dimensions and not by the fact/cubes.
  • Don't develop SSAS with hebrew (or any other right-to-left language). Use hebrew only in translations .I've personally got hurt with this. Maintaining MDX with hebrew is hell.
  • Avoid using linked measures. If you have some, consider joining the cubes to one cube.
  • There need to be a balance between the amount of the dimensions and the amount of the cubes. If they are equal, you've got a problem. BIDS will warn you (in the 2008 version) if you have two measure groups with the same dimensionality.
  • Keep the dimension keys as narrow as possible. Dimension keys based on more than one column is not optimal. Use member properties, such as AttributeHierarchyEnabled, AttributeHierarchyOptimized and GroupingBehaviour.
  • Data types is an issue in SSAS. Make sure the data types in the DSV are the same as in the dimension attributes.
  • The SSAS engine works best when the foreign key between the fact and the dimension is based on integer columns. Read more about dimension design here.
  • Attribute relationships and flexible vs. rigid are critical to performance. As I read more and more about SSAS I understand how much it's crucial to performance.
  • Automatic creation of partitions can be a good solution for a wide variety of problems. Note that when deploying your SSAS solution to the server you might override the partitions you created with your code. To solve this, you can use the deployment wizard or to import your SSAS database into solution.
  • Each partition should have up to 50M records or up to 250MB. Optimize the partition query. It's only plain SQL query. Get help from your DBA and get a better performance.
  • 99% Aggregations will not cover 99% of the aggregation possibilities/combinations. The engine chooses statistically-randomly which aggregation to save. Use the usage based wizard.
  • Most of the processing time is the time required for the relational DB to return the results of the SQL query. As before, get help from the DBA, improve this query (note that you can't change it) and get a shorter processing time.
  • To solve MDX performance issues, you need to learn what is the difference between the formula engine and the storage engine. I was suprised to see that many SSAS developers never heard of it.
  • In MDX Scripts, use scoping instead of conditional calculations when possible.
  • Avoid StrToMember or StrToSet functions.
Monday, June 14, 2010 9:34:43 PM (Jerusalem Daylight Time, UTC+03:00)

These are some important notes I written in the great seminar by Itay Braun. The seminar was held in the SQL & BI conference in Tel-Aviv two weeks ago. Sorry for the delay, believe me I was busy. The title of the seminar was "Solution - Oriented BI for developers". Actually, it was a collection of lectures and tips on general BI and on SSAS, SSIS and SSRS, mainly on the 2008 R2 version. In this post, I'll bring some notes I collected from the first part of the seminar, which was on general BI development and desing. I'll also bring my personal point of view.

  • The solution objective: Taking some reports and showing them is not BI. BI is on combining several sources and bring knowledge based on this integration. Only then it's getting interesting.
  • We need to aspire to answer 5 question through our BI solution: What have happened? Why did it happen? What is happening now? What will happen? What do I want to happen? Just for thinking, try to tell yourself which answers does your BI solution give. I think you'll find out (as I did) that you're far away from these five answers. We all are.
  • BI is a process, not a one-off project. I always say that BI design & development requires maturity. You need to know that you can't do everything in one shot. There are even things that you can do but it won't be right at the moment. The organization also needs to grow up with the solution.
  • You need to decide on an early stage of the project which client tools you going to have. You also need, of course, to decide if you'll have reporting, OLAP or both.
  • The ideal BI solution gives an infrastructure and let the customers do their own reporting. The big disadvantage here is performance. You can't know what kind of a killer-queries the user will run.
  • A good dashboard gives the manager all the knowledge he needs to have in front of him. It tells him what is the current status of his organization without one click. Do you have this kind of dashboard? I don't...
  • Microsoft has a lack of good visual client tools in the BI field. This is where third-side companies, such as Panorama comes.
Monday, June 14, 2010 7:21:47 PM (Jerusalem Daylight Time, UTC+03:00)