As data warehousing experts, we often have to deal with the less-than-perfect realities of the systems that we load data from. I started thinking about the various issues and put together a list of them.
The purpose of this list is to be able to clearly explain to others in your organization why there may be issues with the data warehouse and how many of these issues may not be under the control of the data warehousing team themselves.
In a Platonically perfect world, source systems would be/have:
- Available continually 24×7 with no significant downtime, particularly unscheduled downtime. 15 minutes once a month or something similar would probably be acceptable, but not much more.
- Ability for the data warehousing team to have a seat as a stakeholder for technical and to some extent functional input. Particularly the ability to —
- Suggest additions or modifications for structures such as indices and partitioning for improved query performance
- Have a say in when database patches and platform upgrades take place
- Request and receive prompt creation, update, and deletion of DB users as necessary for access and DW loads
- Not have to worry about nightly query performance load
- Co-located on the same database server with the same DB platform as the data warehouse
- This is a bit complex and has pros and cons — read the complications below about what happens if you’re NOT co-located.
- Never change datatypes, columns, or data structures in general, and if this is absolutely necessary, the DW team would be informed at least one to two months in advance with an exhaustive change list with details down to the table, column, constraint, or other most granular level
- In other words, confidence and trust in the vendor or team managing the source system to be able to know that they provide a high-quality and reliable changelog
- Always “correct” data-wise – that is, no columns overloaded, if any new columns are added all data is reliable, etc.
- Fully documented with a detailed data dictionary
- Have documented archival procedures
- Have a DBA team which is always on-call and available when problems arise (which they invariably do) and which is cooperative and aligned from an incentive perspective with the DW team
- Be proactively managed by the DBA team (often separate from the DW team) to avoid any performance, data, or replication issues
- Basically, work exactly the way the DW team thinks it should 🙂
Note that I did NOT include general adherence to third normal form above – there are a number of reasons even in source (generally OLTP) systems where adhering to perfect normalization is not always necessary or desirable. If the source database is badly designed and deviates from third normal form simply because of poor design and planning, that is one thing and probably needs to be addressed – and will cause a lot of dirty data – but again, there are legitimate reasons in well-designed physical models to deviate to some reasonable extent from 3rd or some other level of normal forms.
In theory, theory is the same as practice, but in practice it usually isn’t. Compared to the above list of scenarios, here is usually what happens in the real world.
- Unplanned or unannounced downtime, or at least downtime with little announcement (e.g. a few hours) which does not provide the DW team with the ability to staff and respond appropriately
- The DW team often has little or no control over the source system data structures, and therefore ends up having to do unnecessary troubleshooting or performance optimization against a less-than-optimally designed database. This is even worse if the source system is on a different DB platform, because platforms vary widely in how they handle things like indexing, partitioning, query execution plans, and performance optimization. This can require the DW team to gain significant expertise in a new DB platform just to write the extract queries to perform at an acceptable level.
- Usually it ends up the DW team’s responsibility to have to write queries that perform appropriately against the source database regardless of deficiencies in the DB design and architecture.
- The source system is rarely co-located on the same DB server. This can cause problems from a performance perspective, and can even prohibit the use of certain features that are very useful for some ETL or DB engines – e.g. Informatica full Pushdown Optimization (PDO) requires all relevant entities to be on the same DB server or cluster.
- In the real world, changes happen to the source database on a regular basis for reasons of patches, functionality changes, or the moon being full/other random reasons. The DW team is often given little or no notice (until nightly loads break) and there is often little or no formal documentation available of the changes requiring an automated analysis or database diff if the team is lucky and has another instance of the source DB at an older version to do an automated diff.
- There is always dirty data, overloaded columns, strange or badly used datatypes (BLOBs, RAW, etc., etc.) or other problems in the data in the source DB.
- Documentation is often poor or non-existent at the necessarily detailed level.
- Often little or no documented archival procedures.
- DBA teams for source systems often have many other priorities which to them are more important than dealing with DW loads. In many organizations the DW team is not part of the same group as other DBAs. The DW team often therefore gets the “short end of the stick” when it comes to troubleshooting any load or performance issues.
- Relating to the previous point, overloaded DBA teams are usually reactive and not proactive in their responses to source DB issues. In many cases the first time they hear of problems is when users report query or systems issues. As the DW team is in many cases writing the most demanding queries covering the largest amount of data at one time, they often end up finding issues, and their issues are often of low priority to source system DBAs if the issues do not affect a significant number of other users.
- Never happens!