Theory vs. Practice in Data Warehouse Interaction with Source Systems

As data warehousing experts, we often have to deal with the less-than-perfect realities of the systems that we load data from. I started thinking about the various issues and put together a list of them.

The purpose of this list is to be able to clearly explain to others in your organization why there may be issues with the data warehouse and how many of these issues may not be under the control of the data warehousing team themselves.

In a Platonically perfect world, source systems would be/have:

  1. Available continually 24×7 with no significant downtime, particularly unscheduled downtime. 15 minutes once a month or something similar would probably be acceptable, but not much more.
  2. Fully under control of the data warehousing team both from a technical and functional perspective, including the ability to:
    1. Modify structures such as indices and partitioning for improved query performance
    2. Determine when database patches and platform upgrades took place
    3. Create, update, and delete users as necessary for access and DW loads
  3. Not have to worry about nightly query performance load
  4. Co-located on the same database server with the same DB platform as the data warehouse
  5. Never change datatypes, columns, or data structures in general, and if this is absolutely necessary, the DW team would be informed at least one to two months in advance with an exhaustive change list with details down to the table, column, constraint, or other most granular level
    1. In other words, confidence and trust in the vendor or team managing the source system to be able to know that they provide a high-quality and reliable changelog
  6. Always “correct” data-wise – that is, no columns overloaded, if any new columns are added all data is reliable, etc.
  7. Fully documented with a detailed data dictionary
  8. Have documented archival procedures
  9. Have a DBA team which is always on-call and available when problems arise (which they invariably do) and which is cooperative and aligned from an incentive perspective with the DW team
  10. Be proactively managed by the DBA team (often separate from the DW team) to avoid any performance, data, or replication issues
  11. Basically, work exactly the way the DW team thinks it should :)

Note that I did NOT include general adherence to third normal form above – there are a number of reasons even in source (generally OLTP) systems where adhering to perfect normalization is not always necessary or desirable. If the source database is badly designed and deviates from third normal form simply because of poor design and planning, that is one thing and probably needs to be addressed – and will cause a lot of dirty data – but again, there are legitimate reasons in well-designed physical models to deviate to some reasonable extent from 3rd or some other level of normal forms.

In theory, theory is the same as practice, but in practice it usually isn’t. Compared to the above list of scenarios, here is usually what happens in the real world.

  1. Unplanned or unannounced downtime, or at least downtime with little announcement (e.g. a few hours) which does not provide the DW team with the ability to staff and respond appropriately
  2. The DW team often has little or no control over the source system data structures, and therefore ends up having to do unnecessary troubleshooting or performance optimization against a less-than-optimally designed database. This is even worse if the source system is on a different DB platform, because platforms vary widely in how they handle things like indexing, partitioning, query execution plans, and performance optimization. This can require the DW team to gain significant expertise in a new DB platform just to write the extract queries to perform at an acceptable level.
  3. Usually it ends up the DW team’s responsibility to have to write queries that perform appropriately against the source database regardless of deficiencies in the DB design and architecture.
  4. The source system is rarely co-located on the same DB server. This can cause problems from a performance perspective, and can even prohibit the use of certain features that are very useful for some ETL or DB engines – e.g. Informatica full Pushdown Optimization (PDO) requires all relevant entities to be on the same DB server or cluster.
  5. In the real world, changes happen to the source database on a regular basis for reasons of patches, functionality changes, or the moon being full/other random reasons. The DW team is often given little or no notice (until nightly loads break) and there is often little or no formal documentation available of the changes requiring an automated analysis or database diff if the team is lucky and has another instance of the source DB at an older version to do an automated diff.
  6. There is always dirty data, overloaded columns, strange or badly used datatypes (BLOBs, RAW, etc., etc.) or other problems in the data in the source DB.
  7. Documentation is often poor or non-existent at the necessarily detailed level.
  8. Often little or no documented archival procedures.
  9. DBA teams for source systems often have many other priorities which to them are more important than dealing with DW loads. In many organizations the DW team is not part of the same group as other DBAs. The DW team often therefore gets the “short end of the stick” when it comes to troubleshooting any load or performance issues.
  10. Relating to the previous point, overloaded DBA teams are usually reactive and not proactive in their responses to source DB issues. In many cases the first time they hear of problems is when users report query or systems issues. As the DW team is in many cases writing the most demanding queries covering the largest amount of data at one time, they often end up finding issues, and their issues are often of low priority to source system DBAs if the issues do not affect a significant number of other users.
  11. Never happens!

Why having an Operational Data Store in your Data Warehouse Is Often Helpful

When building a Data Warehouse, or any sort of “downstream” reporting solution, one of the major questions that often comes up is: “Should we replicate1 all of the data structures that exist in the source databases2 from which we a building a warehouse, whether the sources be:”

  1. A COTS ERP solution (e.g. Oracle E-Business Suite),
    2. A custom in-house developed software application,
  2. COTS domain-specific applications, for example, manufacturing-specific applications for a widget-manufacturing business, or a travel planning application for a travel agency,
  3. Other types of applications which have a relational SQL DB of some sort on their back-end

Also, by replication in this scenario I am specifically not implying one way or the other that replication in the sense of Oracle Streams or Goldengate must  be used to do real-time replication of data structures between databases. One could use a custom-designed nightly load job based on deltas, or if the source dataset is relatively small, perhaps even a full truncate-and-load every night (although I tend not to like full truncate-and-load scenarios because of the extra stress they put on the source, network, and target.) One could use Streams or Goldengate but it is not necessary – and real-time replication also tends to be a bit more fragile in my experience.

Now, why would we want to replicate source databases to the target? Many people would say that this:

  1. Wastes space in the target, and

  2. Unnecessarily complicates the loading process.

These are definitely cons of creating a replica. I would argue that the pros outweigh the cons, and that the copying/replication of source data is usually quite a bit less complicated than the transformation process to turn relational data into a star or snowflake schema. I believe that there are many good reasons to create a copy of the relational source databases at the target and will lay out a case as to why this is so. What is interesting is that many of the strongest arguments for this tend to be as much political as they are technological. An uncomfortable truth about much of reporting and data warehousing is that getting these kinds of systems up-and-running are as much about political challenges as they are about technical challenges.

Reasons to create a replica of source data in your warehouse:

  1. Unreliable source systems — source systems that have a significant amount of unplanned downtime or database availability issues can be a problem for the Data Warehousing team. If you have a copy of the source data, even if it is a day old, it is often a help for your nightly loads and for your developers versus not having access to the source system data at all.
  2. Vendors who do not permit access directly to application databases but do create an abstracted “replica”. I have encountered this scenario with several vendors in the past.
  3. Source systems which are too overloaded with queries related to transformation of data. Copying the source data row-for-row without transformations is usually a lot less costly from a SQL perspective than pushing significant amounts of transformation logic to the source database. It is nice to be able to write whatever SQL is needed (analytic functions, multiple WHERE clauses, aggregates, sliding windows, many joins) without having to worry if you are going to affect availability of critical business applications for users.

Are there any alternatives to fully replicating source databases? Here are the other major options that I can think of.

  1. Scanning source databases for delta/changes on a regular basis – often nightly but not always — and applying these changes directly to denormalized/star/snowflake or other reporting schemas which are derived or calculated representations of the original data. In this scenario the tables to which the changes are applied are directly visible to users via reporting tools and in some cases with direct SQL or table access.
  2. Scanning source databases for delta/changes on a regular basis – often nightly but not always – and creating a staging area which is still a rough representation of the source but with some data cleanup and possibly denormalization or row filtering applied. In this scenario the tables to which the changes are applied are not directly visible to users via reporting tools and changes from these tables would still need to be further applied downstream to denormalized/star/snowflake or other reporting schemas which are derived or calculated representations of the original data.

Option #1 is exactly what we are trying to avoid for the reasons previously discussed. Option #2 might be an acceptable hybrid option in certain scenarios.

  1. By “replicate” I mean make an exact table-for-table copy of everything in the source database with the same column datatypes, index definitions, primary and foreign key relationships*, and other items. In many cases your data warehouse will be on a different relational DB platform than your source applications, e.g. Oracle is the source and Teradata is the target. This would make exact replication impossible as different databases have different implementations of datatypes, etc. In this scenario of differing DB platforms my definition of “replicate” would still generally hold by trying to match the datatypes between platforms as closely as possible. Therefore one would match a NUMBER in Oracle to an INTEGER in Teradata.

    I am also not suggesting that primary and foreign key relationships have to be technically enforced via table-level constraints in the target DB as there are performance and other penalties for this – I am speaking more from a conceptual perspective that the modelled relationships should remain the same between the tables in this replication scenario even if the relationships are not technically enforced by constraints. Whether to enforce constraints on a target copy of the source is another involved discussion with what I believe are good points on both sides that we can continue in another blog post :).


  2. The above list is meant to reference source systems which are mainly of a relational nature. Discussions regarding flat files, Access databases (always frustrating to deal with), Excel files, other non-relational types of source data, or anything that is what we might call “unstructured” are separate and not very relevant to the conversation that I would like to have here. 

Census Data Modeling Project

While in between major projects at work, I have been looking for various things to investigate and learn. IT is a field which requires constant learning to keep one’s skills fresh and relevant!

I decided to look for large, publicly available data sets which would be useful for analysis, particularly with a mind to first fully normalizing the data set (if not already normalized into at least third normal form) and then subsequently denormalizing it into a star schema format for use in data warehousing.

There are various data sets out there, but the most interesting to me was the United States Census. I did some searching and did not find other data sets which had the same volume/size, and were easily imported into a relational database. I am sure that others could point out many data sets that I overlooked but this was a good place to start.

What surprised me as I looked into the Census data was how far away from third normal form or any sort of true relationalized format it was. It is really a bunch of scattered files of varying types. What I wanted was to look at an Entity Relationship Diagram and see all of the exact entities, columns, fields, etc., defined and the relationships between each entity. I then expected to be able to download each of these tables in at least a csv file or something similar and column-for-column, row- for-row, upload these tables into a database (Oracle was my preferred choice in this case) and already be pretty close to being able to denormalize selective portions of the data. I didn’t find anything even close to this!

Here is the closed thing to an ERD that I could find:

Now keep in mind for now I was only interested in the 2010 census data, not previous censuses (which in the US happen every ten years) nor the American Community Survey data which is taken on a yearly basis. So we are not even getting into the complexities of longitudinally comparing censuses from different years, just looking at the 2010 census.

After speaking with a friend of mine who is an economics professor, it seems likely to me that the data was organized and uploaded by people who are primarily economists and statisticians and not data modelers. I say this because my friend wasn’t aware of what data normalization was, never mind things like referential integrity or check constraints, etc. I think not being aware of at least what the normal forms are and why they are useful poses problems when you try to work with larger datasets.

This will be an ongoing project, and I plan to upload all documents and scripts that I have created in this process for others to use and verify.

First Post

As they say on Slashdot.

Welcome to my blog. The intention of this site is to discuss matters relating to information technology, particularly data warehousing, data mining, and from time to time related social and political issues.

Stay tuned for more to come soon.