There is a standard sort of process that goes with loading data into relational databases, including data warehouses. This process usually goes something like this:
1. Get source files, often flat text files that are zipped up (assume text files for the purpose of our discussion today.)
2. Unzip, prepare, and organize the text files into some sort of directory format. This is often necessary in that a bunch of files might come zipped in some sort of grouping (e.g. in Census data, with all data within a state such as county, census block, etc. within one file) where you want to reorganize the data into a different sort of hierarchy or set of tables (e.g. by type of geographic entity in that you want to first load, such as state, then county, etc.)
3. Load source files into some sort of staging tables in the database. In Oracle you would probably do this with SQL Loader although other tools such as Informatica could also do it. The staging tables usually have minimal or no validation, no relationships, and most or all of the columns are fairly large VARCHAR/VARCHAR2 (you are moving away from VARCHAR as per Oracle’s recommendations, right :)?) datatypes.
4. Validate the data with queries in the staging tables, and then create a data model after you’ve been able to validate and possibly cleanse the data in the staging tables.
5. Load the data from the staging tables into the “real” data tables. This can be done using Informatica, custom SQL scripts, Oracle Data Integrator, or other tools.
I have been working on some Census data modeling and importing. As I was working on this I found that there were some rather convenient ways to shorten and simplify this sequence of steps.
First of all, why go through the whole staging step? One of the major reasons to stage data is believing the data has to be in some sort of table in the database in order to query it. With Oracle External Tables this is not true. You can define a table, but then point the source of the table to a flat file on your drive.
A simple definition of an external table would look like this:
CREATE OR REPLACE DIRECTORY CENSUS_VM_DIR AS ‘/media/sf_Census_Data_Files/’;
CREATE TABLE GEO_HEADER_STG
( FILEID CHAR(500))
ORGANIZATION EXTERNAL
(
TYPE oracle_loader
DEFAULT DIRECTORY CENSUS_VM_DIR
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE CHARACTERSET WE8ISO8859P1
BADFILE ‘CENSUS_BADFILE_DIR’:’state_summary.bad’
DISCARDFILE ‘CENSUS_DISCARD_DIR’:’state_summary.dis’
LOGFILE ‘CENSUS_LOGFILE_DIR’:’state_summary.log_txt’
FIELDS LDRTRIM
REJECT ROWS WITH ALL NULL FIELDS
(DB_FIELD1 POSITION(1:500) CHAR(500))
)
location
(
‘*geo2010.sf1’
)
)
PARALLEL 4
REJECT LIMIT 100;
What this does is tell Oracle to look for all the files in the directory /media/sf_Census_Data_Files/, aliased as CENSUS_VM_DIR, for all files matching the wildcard *geo2010.sf1. Wildcarding for External Tables is a handy feature added in Oracle 12c in the last year or so. I define one field named FILEID, tell Oracle that the source text files are fixed width and based on position instead of field delimited, and that the field has a width of 500, starting at the first character on the line and ending on the 500th character.
A few other notes — Oracle has some different character sets. The most basic one is US7ASCII, and that will work for most American or British files. However, if you use other languages that have any diacritical marks such as accents, diareses, umlauts, etc., they will not come through right. Puerto Rico is part of the US and is included in census data. Some townships and other places there have accents in the names which was causing a problem in my import. They would import into the DB without error but would have some sort of strange question mark character instead of, for example, ó. So I had to change the character set to WE8ISO8859P1, which is Western European standard ISO 8859 and then things worked fine. I am not working with any double-byte character sets so I had not have to worry about Unicode.
So where does this leave our process now? We can eliminate a major step of having to import text files into staging tables in the DB in order to import them. The process now looks like this:
1. Get source files, often flat text files that are zipped up (assume text files for the purpose of our discussion today.)
2. Unzip, prepare, and organize the text files into some sort of directory format. This is often necessary in that a bunch of files might come zipped in some sort of grouping (e.g. in Census data, with all data within a state such as county, census block, etc. within one file) where you want to reorganize the data into a different sort of hierarchy or set of tables (e.g. by type of geographic entity in that you want to first load, such as state, then county, etc.)
3. Validate the data with queries against External Tables, and then create a data model after you’ve been able to validate and possibly cleanse the data in the staging tables.
4. Load the data from the External Tables into the “real” data tables. This can be done using Informatica, custom SQL scripts, Oracle Data Integrator, or other tools.
Can we improve the process further? The answer is yes! Step 2 can be rather onerous if there are a lot of files to download, unzip, and organize. Also, having to unzip all of these files takes up a lot of extra disk space. Many of the text files that we deal with can be compressed at ratios of 90 to 95%, which means that for every GB compressed we could be talking 19 to 20 GB uncompressed. This is wasteful of disk space even while nothing is happening, and then even more wasteful when the database scans these text files because it has to perform so much IO. Would it be possible for us to extract compressed files out of zip files dynamically for an external table read whenever we issue a SQL Query? Yes, we can. The key is the PREPROCESSOR directive.
It looks something like this:
PREPROCESSOR ‘CENSUS_EXEC_DIR’:’census_geo_unzip.sh’
This tells Oracle when it does a read on the External Table which this is specified in to execute the script mentioned in the PREPROCESSOR call and use the input (from standard input) as the text feed to apply the rules specified to. Here is what the full External Table directive might look like in this context:
CREATE TABLE GEO_HEADER_STG
( FILEID CHAR(500))
ORGANIZATION EXTERNAL
(
TYPE oracle_loader
DEFAULT DIRECTORY CENSUS_VM_DIR
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE CHARACTERSET WE8ISO8859P1
BADFILE ‘CENSUS_BADFILE_DIR’:’state_summary.bad’
DISCARDFILE ‘CENSUS_DISCARD_DIR’:’state_summary.dis’
LOGFILE ‘CENSUS_LOGFILE_DIR’:’state_summary.log_txt’
PREPROCESSOR ‘CENSUS_EXEC_DIR’:’census_geo_unzip.sh’
FIELDS LDRTRIM
REJECT ROWS WITH ALL NULL FIELDS
(DB_FIELD1 POSITION(1:500) CHAR(500))
)
location
(
‘*geo2010.sf1.zip’
)
)
PARALLEL 4
REJECT LIMIT 100;
Your next question might be “What is in census_geo_unzip.sh”? Here are its contents:
#!/bin/bash
/bin/unzip -p $1 \*geo\*.sf1
This executes the /bin/unzip command, then uses the first command line parameter, which is ‘*geo2010.sf1.zip’ passed by Oracle, and then tells unzip to only extract files out of the zip files which also match the wildcard *geo*.sf1. This is because there are some other text and explanatory documents in the zipfiles which would cause errors and break the field position list that we have given Oracle. Also, the -p parameter is critical — it tells unzip to feed the results back to standard output instead of unzipping the files in question and storing them on the filesystem.
So, after all of this, running the command:
SELECT * FROM GEO_HEADER_STG
Will transparently look for zip files in the location that we have specified on the filesystem, dynamically unzip them only looking for zip files that match our wildcard specification, only unzip files within those zips that also match our (other) wildcard specification, feed the text back to Oracle, and Oracle will interpret it and parse it into fields as we request. Pretty slick!
Now, one issue I have had when downloading US Census data is that there are a ton of zip files scattered all around a directory structure. I just want point my FTP client at the US census site, download all 50GB (or whatever), then point Oracle at the root of what I have downloaded and figure it all out.
The procedure above will only work for a set of zip files which are all already placed in a particular directory. Would it be possible for Oracle to issue a call to recursively iterate for all subdirectories under a given directory, look for zip files matching a wildcard format, and then extract only files from those zips matching another wildcard? You already know the answer by this point :).
Updating the bash script file that we are calling via the PREPROCESSOR directive to this will do the trick:
#!/bin/bash
cd /media/sf_source_zips/2000/Summary_File_1
/bin/find . -name \*geo\*.zip | /bin/xargs -n 1 /bin/unzip -p
This is a bit more involved bash script. It switches to the /media/sf_source_zips/2000/Summary_File_1 directory, issues a command to find all of the filenames matching the wildcard *geo*.zip, and then feeds them one by one to xargs to then individually call unzip for each file.
Here is the final, twice-improved process:
1. Get source files, often flat text files that are zipped up (assume text files for the purpose of our discussion today.)
2. Validate the data with queries against External Tables, and then create a data model after you’ve been able to validate and possibly cleanse the data in the staging tables.
3. Load the data from the External Tables into the “real” data tables. This can be done using Informatica, custom SQL scripts, Oracle Data Integrator, or other tools.
Pretty slick, I think!