EHR Data Quality
Assessment of data quality from electronic health record (EHR) ensures its “fitness for use” in secondary research.
Electronic health record collects clinical data to improve patient care. In addition, EHR has many potential secondary uses such as billing, clinical quality assessment, and research. Since data are not collected systematically for EHR as for research, there are concerns about research’s validity and reproducibility when using EHR data.
Hierarchy of Data Quality Assessment
In general, individual EHR data constructs, or dimensions, are assessed for its veracity (verification) and validity. Veracity evaluates internal validity of data such as conformance to metadata constraints and local knowledge. External validity assesses data against external gold standards or knowledge.
Wang and Strong proposed 4 categories to classify data quality constructs:  intrinsic, contextual, representational, and accessibility. Intrinsic data quality includes concordance, correctness, and plausibility. Contextaual data quality consists of completeness, currency, granularity, and signal-to-noise. Representational data quality comprises of fragmentation and structuredness.
Data Quality Construct Categories
- Intrinsic data quality: quality of data independent of its intended use, such as accuracy, believability, objectivity, and reputation of data
- Contextual data quality: quality of data for its intended task
- Representational data quality: format (concise and consistent representation) and meaning (interpretability) of data
- Accessibility: data accessible (available, retrievable, up-to-date) to data consumers
Data Quality Constructs/Dimension
- Completeness: complete representation of a patient in the EHR with sufficient data for each patient, each health concept, and for each measurement time
- Correctness: accuracy of EHR elements in representation of the patient
- Concordance (consistency): agreement between EHR data elements
- Plausibility (validity): agreement between EHR data elements and general medical knowledge
- Uniqueness plausibility: individual patient represented by a single, non-duplicated record
- Atemporal plausibility: data values and distributions agree with internal and external knowledge or reference standards
- Temporal plausibility: sequence of values conform to expected temporal properties
- Currency: reasonably updated measurements to represent patient state at an instance in time
- Granularity: sufficient information in representing a health concept
- Fragmentation: a complete concept comprised of EHR data elements stored in multiple locations
- Signal-to-noise: “information of interest can be distinguished from irrelevant data”
- Structuredness: “data recorded in a format that enables reliable extraction”
- Conformance: compliance of EHR data elements to “internal or external formatting, relational, or computational definitions.” Conformance can be further categorized to value, relational, and computational conformance:
- Value conformance: agreement of EHR data elements to constraint-driven data architecture (e.g., data type, acceptable range of values, etc.)
- Relational conformance: agreement of EHR data to structural constraints of database architecture (e.g., primary and foreign key relationships)
- Computational conformance: agreement between calculated value and entered value elsewhere (e.g., calculation from weight and height agrees with entered BMI)
Methods of Data Quality Assessment
Data quality is task-dependent: operationalization of constructs and method(s) to assess constructs should depend on dataset and intended research. A literature review by Weiskopf and Weng identified 7 general methods used to assess data quality:
- Gold standard: comparison of EHR data element to gold standard (e.g., data from one or multiple alternative sources)
- Data element agreement: concordance between data elements within EHR
- Element presence: presence of desired or expected data
- Data source agreement: concordance between EHR data element with data from another source
- Distribution comparison: comparison of aggregated data to external sources (e.g., disease prevalence)
- Validity check: face validity (plausibility) of value
- Log review: review of metadata
- Kahn MG, Callahan TJ, Barnard J, et al. A Harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC). 2016;4(1):1244. Published 2016 Sep 11. doi:10.13063/2327-9214.1244.
- Wang RY, Strong DM. Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst. 1996;12(4):5-34. doi: 10.1080/07421222.1996.11518099.
- Weiskopf NG, Bakken S, Hripcsak G, Weng C. A Data quality assessment guideline for electronic health record data reuse. EGEMS (Wash DC). 2017;5(1):14. Published 2017 Sep 4. doi:10.5334/egems.218.
- Lee K, Weiskopf N, Pathak J. A Framework for data quality assessment in clinical research datasets. AMIA Annu Symp Proc. 2018;2017:1080-1089. Published 2018 Apr 16.
- Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144-151. doi:10.1136/amiajnl-2011-000681.
Submitted by Frances Hsu.