Difference between revisions of "Data Lake"

From Clinfowiki
Jump to: navigation, search
(Difference from Data Warehouse)
Line 14: Line 14:
  
 
=Difference from Data Warehouse=
 
=Difference from Data Warehouse=
 
+
A typical data warehouse has its data stored in structured formats where the schema is predetermined. The extract, transform, load, steps are done prior to the storage. A data lake takes data sources in various forms and stores them as is. This allows for cheaper storage and the ETL steps are performed as they are sent to the end user for analysis or dashboards.
  
 
=Data Swamp=
 
=Data Swamp=

Revision as of 13:02, 27 October 2020

A data lake is a central repository that allows the storage and flow of structured and unstructured data sources. This concept is akin to a lake with multiple streams or sources to fill up a reservoir and store data as is, before it is allowed to flow out to various applications within an organization.

Functions of a Data Lake

Data Ingestion

  • Tools

Data Storage and Retention

  • Tools

Data Processing

  • Tools

Data Access

  • Tools


Difference from Data Warehouse

A typical data warehouse has its data stored in structured formats where the schema is predetermined. The extract, transform, load, steps are done prior to the storage. A data lake takes data sources in various forms and stores them as is. This allows for cheaper storage and the ETL steps are performed as they are sent to the end user for analysis or dashboards.

Data Swamp

This is when a data lake can become unruly and become a data swamp.


References

https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

Submitted by Tom Nahass