Guest Blog: Jeremy Winters, www.full360.com
As an AWS Premier partner, and the only partner with expertise in columnar databases in the cloud, Full360 was given earliest preview access to Amazon Redshift, the new cloud-based columnar data warehouse offering which was announced at the Re:Invent conference in November. Here are some of my initial impressions and experiences working with it over the past month.
Setup and Configuration
Setting up a Redshift cluster is a quick and easy process. You just define a few high-level parameters such as cluster, database, and master user name, pick the size and quantity of nodes you want, and AWS does the rest. Within a few minutes your cluster will be built and ready for you to configure and build the database.
Cluster and database configuration is performed through the web UI, or via SQL statements over a JDBC/ODBC connection. You do not have access to the command line on any of the Redshift servers. This approach forces users to treat the database as a service (which is Amazon’s intent) and a data source, as opposed to mucking around with OS level software, settings, and related issues.
I had some issues with the JDBC driver in the first round, as it didn’t work with all my usual client software. Everything works fine if you use AWS’s recommendation of SQLWorkbenchJ. The driver also worked fine in Talend Open Studio, but only if I used the JDBC components, not the Postgres components.
If I were developing an application on top of Redshift right now, I would use APIs that are based on generic JDBC or ODBC drivers (such as DBI) as opposed to something like a Postgres rubygem. I suspect that Amazon will iron this stuff out in the future, easing the transition of an existing Postgres based application to Redshift.
The AWS console portion of Redshift is very impressive. One nice feature of these screens is the way the queries are graphed along with the performance metrics. You can hover over a specific query and the metrics for the query duration are automatically highlighted.
Another tab in the UI allows you to browse a list of all queries executed against the database. Any time you see a query ID in the interface, it is a hyperlink that will take you to the detail page for that query. In addition to the basic query info and related performance stats, this page tells you the explain plan for the query. If the SQL statement is a data load, it will provide a list of file names which were loaded during the query execution!
Not having direct access to the server back-end means that bulk data loads cannot be performed from the local file system. Redshift handles this by providing direct access to files stored in S3. Even better… if you have your data set broken down into smaller files, they will automatically be loaded in parallel. Loading my single-file, 1B row fact table into Redshift took 4 hours 35 minutes, but breaking the table into 40 smaller files resulted in a 7 minute load, which is nearly a linear improvement at 38X!
While developing my Redshift demo, I used a repeatable SQL script which rebuilt the entire database each time I ran it. This script involves DDL, data loading, and some data transformation with large insert/select queries. The end-to-end script execution typically takes around 25 minutes to complete.
Architecture and Tuning
Redshift, like other columnar databases, optimizes the disk storage to meet query performance in the following ways:
- Each column is stored in a separate set of files.
- Columns are compressed with the encoding scheme appropriate to the nature of the data stored.
- Sort order for the columns allows for faster predicate lookups, merge joins and aggregate grouping.
- Records in a single table are distributed across nodes. Fact and large dimension tables can be distributed on the foreign key fields in order to keep the joins local to the node.
In Redshift, you have to specify your encoding (compression) types, sort orders, and compression types in the CREATE TABLE statement. While Redshift does provide a function that allows you to analyze a table and determine the best encoding scheme (this can be performed automatically upon first data load), it does not currently have any mechanism that will define your sort orders or distribution keys for you.
A limitation here is that a single encoding/sorting/distribution scheme will not optimize for all cases. Other columnar databases provide the ability to have multiple projections of a single table, each with different encoding and sort orders, but Redshift requires you do this in a separate table.
After a few rounds of tweaking the encoding and sort orders in my fact table, I was able to get 1-2 second performance when querying for the use case I optimized for. In other columnar databases, I was able to get sub-second query times for the same data set. In the case of BI reporting, I think that 1-2 seconds is pretty reasonable… especially with a billion-row fact table!
In my tests, I purposefully ran queries which are outside of the sort order and encoding scheme to see how the database handled them. These queries tend to run in the 10-60 second range… which is comparable (and even a bit better) than other columnar options I have tested.
Note that keeping up good performance requires ongoing maintenance, such as updating the statistics and running the “vacuum” statement which makes sure that your newly loaded data is completely optimized for querying. These are simple tasks to perform, but at this point you will have to set up the automation yourself. A good time to run these functions is immediately after a data load is completed.
I expect to see a lot of companies jumping right into Redshift as soon as it’s released. The price is unbeatable, and getting started is easy to do. Companies that are currently throwing away detailed data will now be able to retain it inexpensively in S3 and report on it for as long as they desire.
BI and general reporting use cases should work well with Redshift, though I have yet to see it deliver the sub-second performance which may be critical for some analytic applications. Also, the fact that if any node in the cluster fails, the entire cluster fails might be cause for some concern. To be fair, AWS has indicated that it will only take a few minutes to automatically replace the failed node, but this may not be acceptable for some customer-facing applications. This same concern applies to the mandatory maintenance window which you are required to schedule each week.
The thing that I suspect will be the real hurdle for most companies is going to be database design and optimization. If you’re not familiar with columnar databases, you have a definite learning curve to get past. Approaches for optimizing row-based systems, such as normalization and indexing, do not really apply to columnar databases. You really need to understand your data and the way it is going to be used if you want to squeeze the magic out of the system.