AWS Redshift Case Study
The retail vertical is a challenging space to efficiently collect and manage data. Customer and household data are constantly in motion as deduping and other merging techniques affect downstream processes for recognizing sales to the correct customer. Tracking and reporting on marketing campaigns and their effectiveness for driving incremental sales adds an additional layer of data complexity that requires a fast and scalable architecture.
The Challenge
Our client needed a data warehouse platform with more compute than what an on-premise installation could provide. Reporting on marketing tactics and sales performance resulted in queries that ran for hours in some cases while consuming all resources on a dedicated box. SSD drives helped with I/O, however it was still a major bottleneck. The data warehouse design followed a traditional star-schema and leveraged data partitioning but failed to deliver due to performance issues.
Incoming files from the client arrived daily containing POS data for the previous day for all stores across the chain, new and existing customer related data to be updated or merged, promotion data, as well as other tables that tracked segmentation and other attributes across 40+MM customers. Loading and processing the daily incoming files took on average 2.5 hours to complete. PII data was not present.
The Implementation
Redshift’s scalable dense compute nodes were an easy choice. We knew Redshift’s column level compression algorithms would reduce the total data foot print compared to what was occupied on-premise. Rather than taking a lift and shift approach, table design was enhanced to allow optimal distribution of data across the cluster nodes.
S3 was leveraged as a data repository that provided low cost storage as well as a staging area that allowed parallel data loads into Redshift.
The ETL was replaced by table driven commands that were executed by an orchestration layer. All transform was done after data arrived into the data load schema.
The Benefits
Query times against data stored in Redshift saw as much as a 30x improvement in speed.
The ETL to ELT approach with parallel data loads also ran much faster so data was available sooner for reporting without taxing the entire system while new data loaded. Redshift’s Workload Management Queues allowed us to limit resources consumed by the ETL while maintaining fast query times for end users. Short query acceleration was enabled to prioritize queries the optimizer deemed as small and fast to return ahead of larger queries that consumed more cluster resources.
Redshift’s TCO was considerably lower than using Hadoop as well as on-premise per core licensing that the current platform was reliant on.