Data Lake

From Clinfowiki
Revision as of 15:56, 27 October 2020 by Nahata5 (Talk | contribs)

Jump to: navigation, search

A data lake is a central repository that allows the storage and flow of structured and unstructured data sources. This concept is akin to a lake with multiple streams or sources to fill up a reservoir and store data as is, before it is allowed to flow out to various applications within an organization.

Functions of a Data Lake

Data Ingestion

  • Tools

Data Storage and Retention

  • Tools

Data Processing

  • Tools

Data Access

  • Tools


Difference from Data Warehouse

Depending on the requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.

A data lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.#

Data Swamp

This is when a data lake can become unruly and become a data swamp.


References

https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

Submitted by Tom Nahass