Text data extraction for a prospective, research-focused data mart: implementation and validation

From Clinfowiki
Jump to: navigation, search


Electronic health records can generate a large amount of data used for both clinical and research purposes. Data is stored in both structured and unstructured formats. Unstructured data is in the form of textual data used for clinical notes and clinical assessments. It has been difficult to aggregate this data in the enterprise data warehouse for research or to study data for improved patient outcomes. Most textual data is mined through a labor intensive manual chart review. This study created a pulmonary function test data mart pulled from the textual data and analyzed the quality of the data compared to a manual chart review .[1]


Pulmonary Function test parameters were used within a single diagnosis and registry. One data analyst was used to enter data for the data mart. Enterprise Data Warehouse (EDW) report writers extracted the report and delivered them using the SQL Server Reporting Services.


The pulmonary function test data mart was built and using the extract transform and load process using the SQL Server Integration Services. Reports from the data mart were validated with manual chart abstraction. There were 6 mistakes in 1100 with those mistakes being made within data entry.


Collaboration between the architects of the data warehouse and the clinical researchers created a process to abstract structured data from textual data created from an electronic medical record. By creating a tool to assist in the translation and creation of data the use of this tool was more efficient and scalable than manual chart extraction.

Area of interest and comments

SQL Server Integration Services package can be used to abstract discrete data from textual reports from a number of computer generated sources. This leads to further questions about the ability to extract and mine data for the improvement of patient outcomes and quality of patient care. This study has shown a high reliability and validity. The SQL Server Integration Services package has been used to create cardiac catheterization and echocardiography data marts.


  1. Hinchcliff, M., Just, E., Podlusky, S., Varga, J., et al. (2012) Text data extraction for a prospective, research-focused data mart: implementation and validation. BMC Medical Informatics and Decision Making. 12(106). http://www.ncbi.nlm.nih.gov/pubmed/22970696/