When building a Data Warehouse, or any sort of “downstream” reporting solution, one of the major questions that often comes up is: “Should we replicate 1 all of the data structures that exist in the source databases 2 from which we a building a warehouse, whether the sources be:”
- A COTS ERP solution (e.g. Oracle E-Business Suite),
- A custom in-house developed software application,
- COTS domain-specific applications, for example, manufacturing-specific applications for a widget-manufacturing business, or a travel planning application for a travel agency,
- Other types of applications which have a relational SQL DB of some sort on their back-end
Also, by replication in this scenario I am specifically not implying one way or the other that replication in the sense of Oracle Streams or Goldengate must be used to do real-time replication of data structures between databases. One could use a custom-designed nightly load job based on deltas, or if the source dataset is relatively small, perhaps even a full truncate-and-load every night (although I tend not to like full truncate-and-load scenarios because of the extra stress they put on the source, network, and target.) One could use Streams or Goldengate but it is not necessary – and real-time replication also tends to be a bit more fragile in my experience.
Now, why would we want to replicate source databases to the target? Many people would say that this:
- Wastes space in the target, and
- Unnecessarily complicates the loading process.
These are definitely cons of creating a replica. I would argue that the pros outweigh the cons, and that the copying/replication of source data is usually quite a bit less complicated than the transformation process to turn relational data into a star or snowflake schema. I believe that there are many good reasons to create a copy of the relational source databases at the target and will lay out a case as to why this is so. What is interesting is that many of the strongest arguments for this tend to be as much political as they are technological. An uncomfortable truth about much of reporting and data warehousing is that getting these kinds of systems up-and-running are as much about political challenges as they are about technical challenges.
Reasons to create a replica of source data in your warehouse:
- Unreliable source systems — source systems that have a significant amount of unplanned downtime or database availability issues can be a problem for the Data Warehousing team. If you have a copy of the source data, even if it is a day old, it is often a help for your nightly loads and for your developers versus not having access to the source system data at all.
- Vendors who do not permit access directly to application databases but do create an abstracted “replica”. I have encountered this scenario with several vendors in the past.
- Source systems which are too overloaded with queries related to transformation of data. Copying the source data row-for-row without transformations is usually a lot less costly from a SQL perspective than pushing significant amounts of transformation logic to the source database. It is nice to be able to write whatever SQL is needed (analytic functions, multiple WHERE clauses, aggregates, sliding windows, many joins) without having to worry if you are going to affect availability of critical business applications for users.
Are there any alternatives to fully replicating source databases? Here are the other major options that I can think of.
- Scanning source databases for delta/changes on a regular basis – often nightly but not always — and applying these changes directly to denormalized/star/snowflake or other reporting schemas which are derived or calculated representations of the original data. In this scenario the tables to which the changes are applied are directly visible to users via reporting tools and in some cases with direct SQL or table access.
- Scanning source databases for delta/changes on a regular basis – often nightly but not always – and creating a staging area which is still a rough representation of the source but with some data cleanup and possibly denormalization or row filtering applied. In this scenario the tables to which the changes are applied are not directly visible to users via reporting tools and changes from these tables would still need to be further applied downstream to denormalized/star/snowflake or other reporting schemas which are derived or calculated representations of the original data.
Option #1 is exactly what we are trying to avoid for the reasons previously discussed. Option #2 might be an acceptable hybrid option in certain scenarios.
- By “replicate” I mean make an exact table-for-table copy of everything in the source database with the same column datatypes, index definitions, primary and foreign key relationships*, and other items. In many cases your data warehouse will be on a different relational DB platform than your source applications, e.g. Oracle is the source and Teradata is the target. This would make exact replication impossible as different databases have different implementations of datatypes, etc. In this scenario of differing DB platforms my definition of “replicate” would still generally hold by trying to match the datatypes between platforms as closely as possible. Therefore one would match a NUMBER in Oracle to an INTEGER in Teradata.
I am also not suggesting that primary and foreign key relationships have to be technically enforced via table-level constraints in the target DB as there are performance and other penalties for this – I am speaking more from a conceptual perspective that the modelled relationships should remain the same between the tables in this replication scenario even if the relationships are not technically enforced by constraints. Whether to enforce constraints on a target copy of the source is another involved discussion with what I believe are good points on both sides that we can continue in another blog post :).
- The above list is meant to reference source systems which are mainly of a relational nature. Discussions regarding flat files, Access databases (always frustrating to deal with), Excel files, other non-relational types of source data, or anything that is what we might call “unstructured” are separate and not very relevant to the conversation that I would like to have here.