From time to time, ETL will highlight data quality issues. There is often a choice between fixing the issue at the source or in the ETL processes.
In this blog I argue that the issue should be fixed at the source, but if that isn’t practical I provide some guidelines to ensure the best outcome.
Correcting the issue at the source is the gold standard as this will not only correct the data itself, but may address the root cause of the underlying problem (be it process, standards or a technical error). It’ always worthwhile aiming for the problem to be fixed at the source, for the following reasons. In other words … Go for Gold!
It stays fixed
If the issue is fixed at the source, it can prevent similar or related data quality issues from re-occurring. A one-off correction made within the receiving system is no guarantee that the data won’t be reloaded or re-submitted incorrectly in the future, and the issue will re-appear. Correct it at source and it stays fixed.
It stays fixed for everyone
If the issue is fixed at the source, not only is it fixed for the warehouse, it is also fixed for the source system and any other upstream users of the data. Never underestimate the breadth of impact a data quality issue may have. Imagine a spider’s web with lots of tangled threads that radiate out from the centre and you’re looking at a how data quality issues can thread through an organisation . Enforcing a solution at the source means everyone benefits from a consistent and correct view of the data.
It’s usually cheaper
Even though it may appear easier to just patch in the warehouse, appearances are often deceptive. It’is worthwhile keeping the following saying in mind:
“If you want quick and dirty, we can guarantee the dirty but not the quick”. One example of hidden costs is reconciling the source and the target.
The reality is though, sometimes we are forced to apply a patch in the data warehouse . The root cause may involve external systems over which we have little control. The time needed to co-ordinate the fix may be long and complex, meanwhile we need to do something to keep the data flowing.
In this case we need to settle (at least temporarily) for silver, but here are some steps to ensure that the fix is as effective as possible:
Raise the Issue
Ensure the issue is communicated so that other users of the data are aware of it. Those responsible for the root cause should know so that it at least doesn’t increase in impact. Highlighting the impact of data quality issues may be all that’s needed to get the ball rolling towards addressing it.
Isolate the fix
Minimize the flow-on effect of the issue. In other words, avoid becoming part of the problem. Once the root cause is fixed, it should be easy to remove the temporary fix. Don’t comprise on your standards to attain the best data quality possible. Remember – go for gold!
Revisit the Issue
Keep an eye on progress towards fixing the root cause. Don’t drop the issue because the symptoms have disappeared for the moment.
Keep aiming for gold in data quality!