A data lake is a central repository that allows the storage and flow of structured and unstructured data sources. This concept is akin to a lake with multiple streams or sources to fill up a reservoir and store data as is, before it is allowed to flow out to various applications within an organization.

Functions of a Data Lake

Data Ingestion

Tools

Data Storage and Retention

Tools

Data Processing

Tools

Data Access

Tools

Difference from Data Warehouse

A data warehouse itself is a database for relational data, where the data is extracted, cleaned, and transformed prior to being stored in a pre-defined schema. This data is optimized for fast SQL queries.

A data lake stores the raw data, both relational and non-relational data sources, without having to fit it within the constraints of a single database schema. Depending on the analytics required from various areas of the organization, the extract, transform, and load steps are performed within the data lake and distributed to the client depending on their needs. ^[1]In the clinical setting, this allows for storage of free text progress notes, laboratory observations, and imaging data to be all stored in the same central location, but can be used and analyzed together.

Data Swamp

This is when a data lake can become unruly and become a data swamp.

References

↑ Holmes DE. Big data [Internet]. Amazon. Oxford University Press; 2017 [cited 2020Oct26]. Available from: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

Submitted by Tom Nahass

[AWS_What_is_a_Data_Lake.3F-1] Holmes DE. Big data [Internet]. Amazon. Oxford University Press; 2017 [cited 2020Oct26]. Available from: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

[1]

Data Lake

Contents

Functions of a Data Lake

Data Ingestion

Data Storage and Retention

Data Processing

Data Access

Difference from Data Warehouse

Data Swamp

References

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools