This is the second in an ongoing series of posts about potential improvements to Amazon Redshift. Click here for the first post in the series, which was about the COPY command. This is based on work with clients of OnPrem Solution Partners.
Let’s move on to discuss some other ideas for improvements in Redshift.
When rows are deleted or updated in Redshift (an update is processed as a delete and then an insert), the original row is not actually removed from the underlying block storage in the database. You have to manually run a process called VACUUM to “clean” the row out from disk and reclaim the space.
Also, If your table is sorted using sortkeys, any new data inserted will not be sorted in the larger context of the table and therefore any queries run in the meantime will suffer some level of performance degradation until a VACUUM is run on the table to sort newly inserted rows. The level of performance degradation will depend upon the size of the unsorted area versus the overall size of the table. It is not always easy to insert data in sorted order and in some cases depending on the source data that is coming to you, you may not be able to insert in sorted order at all. Therefore, you will need to run VACUUM to sort the new data that has been inserted so that your queries can take advantage of the sortkey. This is doubly true if you are using interleaved sortkeys, which are not sorted by the default VACUUM command — the special VACUUM REINDEX syntax will need to be used.
Some challenges with VACUUM
- VACUUM can take hours or even days for really large tables, and there are a number of reports currently on the Redshift forums about VACUUM aborting or not ending properly. Unfortunately I can’t link to the forums from here as they require an AWS logon.
- In fact, you really should design your tables and your ETL ingest process entirely around accommodating the requirements of the VACUUM process.
- For a database system where one of the main selling points is “We (AWS) manage a lot of the dirty details that you managed before” you have to spend a surprising amount of effort dealing with VACUUMing tables to keep performance and space usage at reasonable levels.
- At one of our clients, we’ve hit some behavior recently where a table of only ~17000 rows was taking ~44 GB of space in Redshift. The response we received back from AWS support when we reporting it was “This is expected behavior when a table without a sortkey gets a small number of rows trickling in over time. You can fix it by making a copy of the table and adding a sortkey so the problem doesn’t happen again in the future.”
- That’s not a helpful response when we have thousands of tables to manage that are constantly being updated. It should be the job of the vendor, not the system user, to fix storage management so that data takes up a reasonable amount of space representative to the semantic meaning it represents at any given point in time.
Amazon should review how the VACUUM command works and assess how the needs of VACUUM start to drive the whole design of the warehouse. They should be more transparent about this with users up-front instead of selling Redshift as a no-DBA solution. It would be helpful if more parts of VACUUM could be automated during periods of low activity in the warehouse, and if VACUUM could always be running a “slow cleanup” in the background.
It would also be helpful for Amazon to acknowledge the current issues with table fragmentation related to tables without sortkeys and how this requires users to add sortkeys to tables and/or do deep copies to keep the tables from using up too much space unneccessarily. In fact, they could even add a new mode/scan type to VACUUM to compact tables and clean up/rewrite fragmented tables on disk transparently without users have to do deep copies.
If I had to guess, I would say that Amazon is probably struggling internally with the architecture that they’ve chosen — branching off of a combination of Paraccel and Postgresql — and VACUUM is a necessary part of that architecture and related to how the data storage/IO layer of Redshift works. Never the less, at the end of the day you and I are buying a service from Amazon, and these are significant inconveniences for users .
One of the major selling points of Redshift (and cloud computing in general) is the ability to dynamically scale hardware resources and therefore costs up and down based on demand at any given point in time.
This is achieved in Redshift by resizing a cluster. Resizing a cluster can involve a) changing from the current node type to a different node type but retaining the same number of nodes, b) changing the number of nodes but retaining the same node type, or c) both. When you resize a cluster, there are some limitations. The most important one is that only read-only queries can be run during the period the cluster is resizing, and they will experience reduced performance.
Here are some of the challenges around cluster resizing:
- Cluster resizing time-consuming and may require some downtime
- Let’s say you have a 10-node dc1.xlarge Redshift cluster, which would give you 160 GB of storage space per node for a total of 1.6TB of storage space for data. If your cluster is getting pretty full — say 75%, 1.2 TB, you might want to add some more servers to the cluster. Redshift allows resizing of an existing cluster — you can just go into the GUI and click “Resize” and tell Amazon how many nodes you’d like the cluster to have. You can even change node types between dc and ds.
- Behind the scenes, what Amazon actually does is create a whole new cluster with the requested # and type of nodes and then copy the data over from the existing cluster to the new cluster. This can take quite a while. I’ve seen it complete as quickly as 30 minutes for a mostly-empty cluster with up to 12 to 14 hours for larger clusters running on dense storage nodes (which use HDDs instead of SSDs and therefore data reading and writing is slower.)
- Inconvenient to interrupt ETL operations and writes and painful to communicate what is essentially a partial system outage to users
- While the resize is going on, the cluster is read-only and will probably experience reduced read performance. As noted previously, we have seen cluster resizes take anywhere from 30 minutes to up to 14 hours depending on the number and type of nodes in the source and destination clusters.
- We have many ETL jobs that run frequently — every hour to two hours — and so dealing with a no-write scenario for up to 14 hours (or potentially more as our data volumes grow) is painful. Our users expect data to be up-to-date 24×7 and to lag only 2 to 3 hours behind data in source systems.
- If we do want to resize our cluster, we have to:
- Send out an outage notice to our users that data updates to the system will be turned off for up to a day to allow for the various items below to be completed
- Turn off existing ETL jobs
- Initiate the cluster resize
- Monitor the cluster resize job for completion
- Process backlog for incoming data (usually files, unfortunately, from the vendors that we are importing data from)
- QA backlog processing
- Re-enable automated ETL ingest
- Notify users that the cluster resize is complete
- This was a lot of work…!
- Scaling is high-friction compared to other AWS functions and cloud computing expectations in general.
- Many other scaling functions, such as for EC2 instances, can be done in minutes or less in AWS. If you have a family of application processing servers/webservers which are running under heavy load, you can easily just add another server or servers to the the already existing machines without disturbing existing processing. Yes, you can technically spin up a new Redshift cluster in about 10 to 15 minutes in an automated fashion with the CLI, but modifying existing Redshift clusters with significant amounts of data is a much more involved and time-consuming proposition.
- it is hard to explain to users and non-technical executives why some parts of the AWS infrastructure can be so easily and dynamically scaled where other parts can take many hours to scale and require partial outages to do so.
- I’m not sure if there is a great solution to this particular issue given Redshift’s current architecture. Amazon would probably need to split compute from disk storage in the way that Snowflake has, but that would be a total re-architecture of the product.
- Redshift should have the ability to just add nodes (of the same type) to an existing cluster and then quietly, asynchronously re-distribute data between existing and added nodes. It might take longer then the current resize operation but it would be a lot easier for operations managers/ETL folks/DBAs.
- This would also be helpful for the removal of nodes of the same type from a cluster as long as a sufficient amount of storage space is available after the nodes are removed.
- This would obviously not work if switching to a different node type — in that scenario the current process of creating a new cluster in the background and copying data over to it would have to be maintained.
- Redshift should provide an estimate of how long a cluster resize is going to take before you execute the resize. It’s really hard to plan for cluster resizing when you don’t know if it is going to take 2 hours or 10 hours. Even if providing this estimate itself takes 15 or 30 minutes to scan the data in the source cluster.
- Amazon suggests here that you take a snapshot of your existing cluster and then restore that snapshot to get an estimate of how long a cluster resize will take. Yes, I could technically do that — but again, the overarching theme of my series of posts on Redshift is asking why I, as a user of the tool, have to put in extra work to determine or accomplish things that the tool should be capable of telling me itself?
I think this is enough for today in terms of potential improvements. I definitely have more ideas and this series of posts will continue.
Also, another interesting blog post here about Redshift Pitfalls.