Data Lakes – An Agile Approach to Information Management
What is a data lake?
A data lake is a storage repository holding large amounts of disparate data that can be stored in native or raw format. Popularity of data lakes has grown as more companies enter the big data arena and more tools become available to query autonomous data. By decoupling storage from compute, companies can store any data set, large or small, for current and future use with minimal cost. Additionally, metadata descriptors are indexed and stored alongside data enabling search engine like capabilities to locate data sets that may align to a business or project need.
How are data lakes different than a data warehouse?
Data residing in traditional data warehouses normally goes through multiple layers of hygiene, standardization algorithms, and validation against a master source. Data is typically structured using star or snowflake schemas and is bound by the data type assigned to each column. Changes can require considerable effort and testing to ensure downstream processes, such as reports, will not be adversely affected.
The following highlights some of the key differentiators between data warehouses and data lakes:
Data lakes should not be viewed as a replacement for data warehouses. Data is not held to the same stringent vetting and inspecting processes required of warehouse data. They are designed with innovation in mind, providing data scientists and analysts with a platform for data experiments.
Contact us to discuss how data lakes can transform the way your company manages its data assets.