At OnPrem Solution Partners, we have extensive experience with Amazon Redshift. If you are not familiar with it, Redshift is a column store relational database which is designed as a Decision Support System (DSS). A DSS is a database which is intended to run queries that are focused on reading and aggregating large volumes of data at once, usually filtering on a wide and dynamically changing group of columns scattered across various tables.
Redshift has many positive and powerful qualities:
- Can quickly scale to large amount of storage space and compute power on-demand
- For every operation which can be done through the AWS GUI, there is a corresponding ability to do the same thing through the AWS command-line interface as well as mature Python and Java APIs.
- Drivers and connectivity available for a wide range of common tools such as Tableau, Microstrategy, Informatica, etc.
- Redshift Spectrum, which allows defining what are more-or-less external tables in S3 but querying them as a Redshift table. You can even join some tables which physically exist in Redshift with other external tables which are backed by files on S3.
- Deeply integrated with other AWS services
- Which might actually be a double-edged sword if you are worried about cloud vendor independence.
- Amazon handles running the servers, OS and DB installation, and a host of other details that you would have had to deal with in the bad old days.
- They also handle patching of Redshift server versions on a scheduled basis, and are releasing new and updated features on a regular basis — often bi or tri-weekly.
- List of feature updates.
However, there are a number of “gotchas” associated with using Redshift. In the interests of providing more information to those who are evaluating various database and analytics platforms I’d like to share some of these items with my readers so you can make a more informed and helpful decision. This discussion will be detailed enough that it will require multiple blog entriees, so this will be the first of a multi-part blog series discussing Redshift.
In this first blog post I’d like to discuss some of the existing issues that I have found with the COPY command and some improvements that might be helpful. In fact, this will veer into “What could be done to make it easier to load data into Redshift?” and not just “How could COPY be improved?”
So, how does COPY work?
If you want to insert any significant volume of data to Redshift, the recommended way to do it is to create a file with the data in it, place it in S3, and then initiate a COPY command from Redshift to load the data from the file into a Redshift table. If you have a large amount of data to load, then you are advised to split it into multiple files of roughly the same size and put them in a common folder in S3 that your Redshift instance has access to. This brings us to the first issue with the COPY command:
1. Clumsy to use to load datasets that aren’t already in files
For example, it’s not uncommon to be doing data processing in a language like Python and using the pandas library. You have a large matrix of data that you’ve processeed in pandas, and you’d like to write it to Redshift for other users to query. You can use JDBC/ODBC to insert/update/delete data in Redshift, but if you do it for more than a handful of rows it will be really, really slow. e.g. this:
INSERT INTO schema1.table1 (column1, column2) VALUES (1, "some text")
Moving through and running 10,000 of these statements sequentially in a loop will take minutes, which is an eternity in database time 1 To be fair to Redshift, this is intentional as it’s a column store and it writes data in immutable 1 MB blocks at a time. You could try to improve it a bit by doing a multi-row insert, such as this:
INSERT INTO schema1.table1 (column1, column2) VALUES (1, 'some text') (2, 'more text') (3, 'even more text')
Amazon even recommends here to do multi-row inserts if COPY commands simply won’t work for your use case. However, the maximum size of a single SQL statement to submit to Redshift is 16 MB so for data sets of any size you’ll have to write a loop to parse through your data set and chunk it out into SQL statements of 16 MB max and then individually run them. Still clumsy.
My suggestion: Amazon should provide a wrapper API that transparently takes various data structures from common languages such as Python and Java and handles the task of parsing the data structures into files to copy to S3, running a COPY command to load them to Redshift, and performs most necessary error handling and plumbing for the user. Parameters to this API could include an S3 bucket to use for the COPY, an access key & secret key or IAM role to use to access Redshift, and the data set to load. Spark already provides a similar API to what I am suggesting (using DataFrames, although it’s not written or supported by Amazon) so this is a totally doable thing.
By the way, one might respond that Kinesis Firehose is more-or-less what I am looking for in this API. However, setting up Kinesis Firehose requires developers to learn and implement a whole other Amazon product that has a different set of “gotchas”, requires defining the schema not just in the Redshift database itself but also another copy of the schema in the Firehose definition, doesn’t work if your Redshift cluster is not publicly accessible, and finally (the coup de grace) costs more money. Requiring Redshift to be publicly accessible is what killed it for us at various clients who have sensitive/PII data in their clusters and in any case why should we pay more money to use a tool that we’re already paying for?
You can certainly get third-party ETL tools such as Informatica which can handle direct inserts into Redshift, or at least mask/hide the whole process of loading from S3 via COPY, but many people don’t have such tools available and just have to use what comes with base Redshift.
2. Very restricted in terms of transformation logic on data import
Loading data from flat files into relational databases is something that has been going on as long as there have been relational databases. Oracle sqlldr has been around for at least 20 years to do this (and probably longer). Every relational database has a tool to do this.
Sqlldr allows many different things — you can do complex validation and transformation of data on the way in. With COPY, you’re much more limited. You really have to map straight from the source files to the target columns in the one database table that you’re looking to load to. The only flexibility it offers is to allow you to map between columns in the source file and the target table if you’re lucky enough to have some sort of schema embedded in the source file (such as row headers in a csv file or schema with JSON.)
To be fair to COPY it does do some validation, in particular a) checking that imported data matches the specified datatype in the target column and b) verifying NOT NULL constraints if you’ve added them. Keep in mind that you can create Primary Key, Foreign Key, and Unique constraints in Redshift but Redshift won’t enforce them and therefore COPY will happily import data that doesn’t correspond to your constraints.
It would be really handy to be able to create conditional logic to say “if rows start with a number, then insert into table 1, otherwise insert into table2.” Or “If the first column has a number in it, then pass it in as an integer, but if it’s a string NULL it out.” Or even just “add an additional column to the stage table as I’m loading telling me the name of the source file that this row was loaded from.” In particular, the last item is one that would be hugely helpful to us.
I’ve run into many data sources that are less-than-clean. For example, complex escaping scenarios in the text which can’t be handled by the ESCAPE parameter. We could just drop rows on import that aren’t clean but that’s not what we want — we want to import all the data cleanly. So it’s necessary do time-consuming preprocessing outside of Redshift.
What I usually end up having to do with Redshift is write a multi-step process which involves a) first creating and loading a staging table that matches the schema of the files you’re importing, b) validating data in the staging table, and c) then import from stage to whatever other actual schema holds the transformed data that end-users will query. This is doable — but it’s just that, like many other things in this list, it means more work for you and more steps in your processing, each of which can potentially introduce errors.
My suggestion: COPY should allow much more extensive transformation and validation logic on import and should also have “meta-columns” that can be imported using special keywords that would allow you to populate a column telling the name of a file that a row was loaded from. It would also be helpful to have meta-columns telling the row number of the file that it was populated from, the type of file, etc.
3. The implication is that you should design your ETL process AND your database schema around the COPY command when loading external data into Redshift
If you want high-performance data loads, there are a lot of details to think about.
One of Amazon’s selling points around Redshift is that they take away a lot of the complexity and difficulty of managing a database system so you can load and analyze data more quickly and easily with less training. Also that you don’t need as many DBAs to maintain the system and that it should not be a difficult transition from other platforms. However, if you look through the details of the Redshift documentation, particularly around cluster maintenance, table design, and data loading, it quickly gets complex.
My suggestion: Amazon should put more effort into simplifying database management for users of Redshift. There needs to be a more holistic focus on insulating users from the details of database platform management and improvement of existing tools to manage and load data. As Antoine St. Exupery said, “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”