Data Warehousing

From Clinfowiki
Revision as of 23:47, 22 October 2017 by Kkatranji (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


A data warehouse (DW) sometimes also known as an enterprise data warehouse (EDW) is a large store of accumulated data from many sources. This system has become critical in the support of large enterprise for its utility in reporting, data analysis, and is a core component of business intelligence today.

As personal computers and data systems spread, the system architecture began to become very complex. An example of this is a “Spider Web Environment” showing the linkages between all the different computers. Studying and maintaining data from such a system with unwieldy data became stressful. Computers roles grew from handling automation, to handling large data, to finally helping in managerial and strategic decision making. They are at the core of business development. And since data management is so central to a successful business, a new way of managing data was needed. An accessible central repository became necessary.

We are projected to have created forty Zettabytes of data in the world by 2020. One zettabyte is equal to one trillion gigabytes. How can we organize and store this data effectively?



History of DataWarehousing

1970 -1st use of term and development of founding principles

1970’s – ACNielsen provided clients with something called “data mart” for sales efforts

1988- IBM systems journal article published An Architecture for a business information system, coined the term “business data warehouse”

1990 – Bill Inmon, named in 2007 by Computerworld as one of the “Ten IT people who mattered in the last 40 years” formed his first company, prism solutions. One of Prism’s main products was the Prism Warehouse manager, one of the first industry tools for creating and managing a Data Warehouse.

1992 – Inmon published Building the Data Warehouse a seminal volume of the industry. Inmon was Advocating for a large centralized data warehouse with Solid relational design

1996- Ralph Kimball’s The Data Warehouse Toolkit, published and provided theoretical background for the concepts surrounding Data Warehousing. Kimball advocated for individual data marts at departmental level that are integrated using information bus architecture.

2012- Bill Inmon developed and made public technology known as “textual disambiguation.” Textual disambiguation applies context to raw text and reformats the raw text and context into a standard data base format. Textual disambiguation is useful wherever raw text is found, such as documents, email and so forth




History of Data warehousing in healthcare

1972- HELP – Health Evaluation Through Logical Processing- HELP was the first hospital information system to collect patient data needed for clinical decision-making and at the same time incorporate a medical knowledge base and inference engine to assist the clinician in making decisions"   The HELP hospital information system has been operational at LDS Hospital since 1967. The system initially supported a heart catheterization laboratory and a post open heart Intensive Care Unit. Since the initial installation, the system has been expanded to become an integrated hospital information system providing services with sophisticated clinical decision-support capabilities

1987- Computer Programs to Support Clinical Decision Making Edward H. Shortliffe, MD, PhD

1994- VOSER- controlled medical vocabulary server created to support the exchange of patient data and medical decision logic.



The Four V's of Big Data

-Volume- Scale of data. How much Data? 90% of today’s data has been created in the last 2 years

-Variety – Different forms and views. The diversity of data. What Kind of data? 90% of generated data is “unstructured”

-Veracity – Certainty of data. How true is this data? Poor data quality has cost the US economy around $3.1 trillion dollars a year.

-Velocity –Speed of ingestion. How fast can we manipulate data?

-Value – The Fifth “V” Ability to achieve value through insights from superior analytics. “How much value is in this data?”

[1]

Key Tenets of Data Quality

Accuracy - Data should be precise for the use it is intended . If possible, Data should be captured only once, although it may have multiple uses. If data isn’t accurate it is worse than nothing

Complete – There should be clear data requirements specified for the information needs of the organization and their data requirements . Incomplete isn’t useful Timeliness - Data should be captured as quickly as possible and made available in a timely manner. Data should be available quickly and frequently enough to support information needs and to influence service or management decisions. Data that isn’t timely is less than desirable

Single Source / Reliability There should be processes for stable and consistent data collection across collection points and over time. Progress toward performance targets should reflect real changes rather than variations in data collection approaches or methods.When multiple versions of data exist, relying on the wrong value can lead to bad decisions. Must be one source of truth for data

Validity / documentation Data should be recorded and used in compliance with relevant requirements, including the correct application of any rules or definitions. This will ensure consistency between periods and with similar organizations, measuring what is intended to be measured. Data without documentation is questionable value



Components of a data warehouse

- Subject oriented – Data is organized to have similar data linked together. Warehouses are curated to help you analyze data. If you need to learn more about patient data, you can build a warehouse that concentrates on patient data that answers questions about your patient data.

- Nonvolatile – Data is stored and committed, read-only, never deleted for newer data. Simply means that data does not change in your data warehouse. Everything going in, is the same coming out.

- Time Variant – The time data changes are recorded and tracked in the warehouse. A data warehouse should be able to focus on any changes that occurred over time. This enables the user to better can visualize trends and changes over time.

- Integrated -Instead of looking at data parochially, it allows analysts to look at it collectively. It allows for a corporation wide view. We an use Google as an example, when searching for a key phrase. You don’t have to troll through hundreds of web pages, rather Google pulls it together and displays it for you. You only need one stop: Google




A single view of Truth

A single source of truth can be constructed as a composite view of record that spans multiple data sources. It is the creation of one single source of truth. If there is a difference of opinion, the single source can settle it.




Unified Data Architecture

- Data is organized in an architecture that that is searchable, organized, and can have analytic capabilities applied. A unified data architecture is any architecture which seeks to combine analytics capabilities across a plethora of data types” – Ford, Teradata UDA should have high performance data access. Data needs to be restructured, indexed, with a cost-based optimizer to be fast. Unified data architecture must also be robust with high availability. High performance means nothing if the data is not available. A database must be as close to 100% available to be effective.




ACID (Atomicity, Consistency, Isolation, Durability) properties

ACID is a set of properties of transactions with a database. You need to trust the data as you transact with the database. If a change is made to an account balance, ACID ensures all the changes are applied or none (atomicity), that the changes are exact (consistency), that no one else can change it at the same time you do (isolation), and that the changes are 100% recoverable across any kind of failure.



Building a Data Warehouse: Growing pains

Integrating data-Many corporations have legacy systems that are intractable. People are reluctant to make changes in their old systems. So, a huge step is going to legacy systems, figuring out what data you have and figuring out how to turn application-specific data into corporate integrated data.

- “The value of integrated is worth the pain of dealing with unintegrated data”.

Volumes- Most IT departments never must deal with the volume of data that goes into a data warehouse until they must build one. Under old data storage systems, old data is bad. Old data clogs the system and slows everything down. To make the system more efficient, old data must be dumped. But, there is value in old data, and in some cases, longer term, longitudinal analysis is invaluable.

Data construction- For most development projects, the developer gets the requirements and builds the system. For Data warehouses, it isn’t until after the system starts being built (and people can see what it does), that businesses can then figure out what they need, and what they want the system to do. The act of building the system opens the mind to the possibilities.




ETL (Extract, Transform and Load)

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. ETL involves the following tasks:

Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing. During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary must be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. 

Transforming the data may involve the following tasks:  applying business rules (so-called derivations, e.g., calculating new measures and dimensions),  cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),  filtering (e.g., selecting only certain columns to load),  splitting a column into multiple columns and vice versa,  joining together data from multiple sources (e.g., lookup, merge),  transposing rows and columns,  applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing) The emphasis is scalability. Many long-time users of the Oracle Database are experts in programming complex data transformation logic using PL/SQL.

Loading the data into a data warehouse or data repository other reporting applications



Data Warehouse Architectures

Enterprise Data Warehouse:

Operational Data Store- ODS – hybrid environment, application data transformed into integrated format (through Extract Transform Load). ODS allows for high performance processing, like update processing. Protects data warehouse from update processing

Exploration warehouse- Where end users who want to do analytics and discovery processing go. Business heuristics. Also tend to hold data on a project basis, and do the heavy processing of the data (protecting the data integrity of the data warehouse)

Data mart- Customized to end user preference on how data should be presented, often as a group within a department, as an example Finance and marketing and accounting have their own individual data mart.


Data Mart Data Warehouse

Instead of building one big system, Data marts focus on creating many small systems. Data Mart are specialty or provider specific collection of data. Data marts are customized to end user preference on how data should be presented, often as a group within a department, and example of this is if a finance and a marketing department have their own individual data mart. Because it is configured for each end user, it is difficult to reconcile different departmental answers. As more data marts are added, more legacy data extracts must be added. With multiple data marts, any change that must happen must be applied throughout all the data marts. Each change must echo throughout all of them. Most new data marts must be built from scratch, as there is no realistic way for new data to leverage previous work on other data marts.


Late Binding Data Warehouse

A new model built to cater to Healthcare’s unique needs, previous two examples are considered “early-binding” models. This model, by binding data later, you can delay the application of rule attached to the data for as long as possible. This provides health systems with the advantage of time. Enterprises can then revise the data, create hypothesis and return optimal analytics. Data brought in from the source system to the data warehouse is brough in its raw form, and very little transformation occurs (only things like “PT_NAME” and “PT_NM” to “patient name”). Data from the source mart are extracted to create a subject area data mart patient populations (diabetics of asthmatics) or a population of events (OR workflow, admission workflow, discharge workflow). Binding only occurs when a specific business driver or use case calls for it. This affords data modelling flexibility, a record of all changes to vocab and rule bindings are kept track of, there is an iterative approach which allows analysts to break detailed, technical work to manageabl chunks, and this architecture allows for granular security.


The Active Data Warehouse

Active data warehouse is repository of any form of captured transactional data; there is difficulty with data integrity, which can be problem of locating the data when it needs to be plucked out when transaction does not execute properly. There also must be enough capacity to ensure that system resources will be available during peak period processing time. Active data warehouses may have a massive amount of capacity sitting unused for long periods of time. Heavy statistical processing can conflict with system resource utilization. Cost of unused capacity, all detail should be stored in a data warehouse



The “Federated” data warehouse

In this schema, there is no independent data warehouse, rather all the old legacy systems are glued together. You can imagine this lends to poor performance: Problems arise if any of the legacy systems don’t work with the others, which happens in most the cases. With no source of truth, problems also occur if data is queried twice. This data can essentially be valueless if it’s not integrated. Federation creates an environment where multiple vendors must have software that sits on top of each other and work together. Oftentimes old historical data is discarded for better performance. Users are stuck with whatever granularity already exists within the legacy systems, without the ability to have it refined to one specific standard. If they want a different level of granularity within the system they won’t be able to refine (or coarsen) it.



The Star Scheme


The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional tables are DE normalized. Even though the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle. Could make a quasi-data warehouse made of star schema and data marts linked together. If needs stay the same over time it is okay, however as requirements change you must develop new data table. This schema is given a set of requirements and only optimized for one audience. Other audience finds it less than optimal. Since it is only optimized for only one audience, there is the proliferation of the star schemas with different levels of granularity. To solve the problem of multiple granularity, each must be brought to lowest level



UDA vendors: Teradata, Oracle, and Microsoft

Teradata- Relational Data Base Management System (RDBMS) storing information (relations) of columns and rows. Virtually all relational database systems use SQL. In 2012 Teradata announced different Hadoop offerings, ranging from straightforward Hadoop appliance to downloadable database software. With introduction of Hadoop (open source) they’ve had to move past their previous model of consolidated data storage, pushed by customers who need flexibility to deal with unstructured data. Creates a more modular architecture.


Oracle – Object-relational database (ORD), or Object-relational database management system (ORDBMS). Like relational database but also with object-oriented database with the inclusion of objects, classes and inheritance directly supported in database and query language. Goal is to bridge the gap between relational and object-oriented modeling techniques. Uses Object-relational Mapping (ORM) software. Characteristic properties of these systems are complex data, type inheritance, and object behavior.

Both oracle and IBM both offer customers the cloud era distribution. Both IMB and Oracle have turnkey Hadoop appliances.


Microsoft SQL server- A relational database management system. There are many versions of their product tailored for different organizations and workloads.


Hadoop - is an integrated RDBMS/HDFS an emerging architecture for the modern data warehouse.  For example, an emerging best practice among data warehouse professionals with Hadoop experience is to manage non-relational data in HDFS (i.e. creating a “data lake “) but process it and move the results (via queries, ELT, or PolyBase) to RDBMSs (elsewhere in the data warehouse architecture) that are more conducive to SQL-based analytics. So HDFS serves as a massive data staging area for the data warehouse. Hadoop acts as the ingestion point for the data, then it is processed and shipped to the warehouse, instead of just putting it all in the data warehouse


References

Inmon, Bill. A Brief History of data Warehousing: from the Vendors Perspective (Part 1). EIM institute. Vol 1:3, May 2007.

Kempe, Shannon. A Short History of Data Warehousing. Dataversity. Aug. 23 2012. Accessed Mar 15 2017

Murphy, Ian. The Evolution of the Enterprise Data Warehouse, Starring Hadoop. DataInformed. July 24, 2013. Accessed on Mar 17, 2017

http://data-informed.com/the-evolutionof-the-enterprise-data-warehouse-starring-hadoop/ Teradata: How unified data architecure can revolutionize analytics in government. Industry White Paper. Govloop. Accessed on 3/21/2017

W.H. Inmon, Derek Strauss and Genia Neushloss. DW 2.0 The Architecture for the Next Generation of Data Warehousing. 2008 Elsevier Inc


Submitted by Kenan Katranji