One of the best ideas I’ve heard over the years and implemented right away is quality gates – after almost each step of the ETL, we stop for a while and check that everything is OK until now. For example, after we take the daily data from the source system we check that the amount of records which has been fetched from there is what we expected. Another example is full records count: we check that the total amount of records in the DWH table is equal to the total amount of records of the parallel table in the source system after filtering, cleansing, etc.
After a few years of experience with the concept I can say that it has a major bug: It takes too long. Every ETL process, especially when the data has to be up to date, has to finish as soon as possible. In every minute of the process it can fail for a lot of reasons. For exmaple, the integration server can reach some kind of a limit (CPU, memory, # of parallel processes), the DB can lose contact or the destination table can get locked. This is why we want that the ETL will be as quick as possible, which is contradictory to the quality gates concept.
In order to solve the problem, I’m about to take the quality gates and build them as a post-process which will run after the ETL finishes and the data has been deployed to the clients. If this process will find a problem it will notify us immiedetly (or even send a mail to the clients in serious scenarios). Pay attention that for this to work we need to save the middle-process tables and the stage tables because we’ll want to count them and compare them to the source system. This solution is not perfect – when a problem will occur (and it will happen for time to time) we’ll have some time when we’ll show our clients not-complete or too-much-complete data, but I think that it’s the best we can do for now.
We’ll not sacrifise the quality of the ordinary daily process for the one-to-hundred times when we have problems.