http://clinfowiki.org/wiki/api.php?action=feedcontributions&user=Jindala&feedformat=atomClinfowiki - User contributions [en]2024-03-29T04:44:48ZUser contributionsMediaWiki 1.22.4http://clinfowiki.org/wiki/index.php/Mining_Electronic_Health_Record_DataMining Electronic Health Record Data2020-05-06T22:02:25Z<p>Jindala: </p>
<hr />
<div>==Introduction==<br />
Data mining is a means of leveraging very large data sets (Big Data) to discover meaningful information. In general, it is used to find patterns that are difficult to see using traditional data analysis techniques. Data mining uses techniques from artificial intelligence and statistics, especially machine learning. <br />
<br />
Traditional database reports or queries only report what is in the database to the users. Online analytical processing (OLAP) can be used by an analyst to test hypothesis however the analysis must first formulate the hypothesis (3). The advantage of data mining techniques over OLAP is the ability to find patterns in the data without formulating an initial hypothesis, possibly finding relationships that an analyst would never have considered.<br />
<br />
Data mining requires large databases or data warehouses which generally require new storage technologies including Hadoop, NoSQL, and NewSQL.<br />
<br />
In electronic health records, data mining has been used to determine the effectiveness of treatment, find patterns among similar patients, interpret genomic data, and optimize clinical decision support. (2,4,6,7)<br />
<br />
==Types of Data==<br />
Generally speaking, data mining requires a "flat file" or vector data format rather than relational tables or objects (4). This means that all the data for each case of observed values appears as a single record rather than as a normalized relational table. Data can be numerical or categorical (both ordinal or nominal).<br />
<br />
The Meaningful Use program has led to the collection of a Common Clinical Data Set (CCDS) across most providers, and this data is generally available in EHR. Common clinical information includes demographic information, diagnosis, problem lists, family history, allergies, immunizations, medications, procedures, lab values and orders, vital signs, radiology and other reports, cost and billing, genetic information, and social data among many others. Data may be in the form of free-form text (non semantic data) through continuity of care documents or notes, or it may be structured and semantic such as ICD diagnostic code data, Lab data, Drug information, digital images with radiology data. Data can also be administrative including scheduling or billing information.<br />
<br />
Key data format standards exist for some of these data, but not all.<br />
<br />
Patient identifiers: generally there is a unique patient ID (MRN) which may be connected to a health information exchange that is state or health network wide.<br />
<br />
Diagnoses: International Classification of Diseases (ICD), Systematized Nomenclature of Medicine (SNOMED)<br />
<br />
Medications: National Drug Codes (NDCs), RxNorm, Systematized Nomenclature of Medicine’s (SNOMED)<br />
<br />
Procedures: International Classification of Diseases’ Clinical Modification (ICD-CM), Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System (HCPCS)<br />
<br />
Lab data: Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), Current Procedural Terminology (CPT)<br />
<br />
Vital signs: Logical Observation Identifiers Names and Codes (LOINC)<br />
<br />
<br />
==Potential Applications==<br />
<br />
The ultimate purpose of data mining is to provide new information which can turn into new knowledge. This knowledge can be used to gain a competitive advantage, earn a greater profit, improve healthcare outcomes, provide better services, advance scientific knowledge, or increase efficiency of system processes.<br />
<br />
It can be used to '''assess risk'''. A recent study by Berrouiguet et al. analysed a database of health records from 2802 suicide attempters to identify risk factors (2019). They used clustering methods to identify similar patients and regression trees to estimate number of suicide attempts. They then used this information to suggest '''clinical decision support'''. <br />
<br />
Integrating genomics, that is, genomics as seen in the context of physiology, ecology, and evolution to better understand phenotype, has been a great poster child for data mining. Data mining in this context uses molecular data with deep learning, cluster analysis, gene-set enrichment analysis, random forest, among other approaches. Data mining has been used in this context to discover new function through mining genomic sequence data. <br />
<br />
It can be used to '''identify''' an event. For example, Sundermann et al. verified the data mining approach by identifying correct route of transmission and assessing preventable cases in hospital outbreaks. They concluded that "Correct routes were identified for all outbreaks at the second patient, except for one outbreak involving >1 transmission route that was detected at the eighth patient. Up to 40 or 34 infections (78% or 66% of possible preventable infections, respectively) could have been prevented if data mining had been implemented in real time, assuming the initiation of an effective intervention within 7 or 14 days of identification of the transmission route, respectively." (Sundermann et al. 2019). <br />
<br />
It has been used to fill in '''missing information''' in the record by analyzing patterns. Groenhof et al used data mining techniques to identify 1661 participants who did not have a smoking status records in the EHR. They compared the result to a questionnaire that these patients had all received and found that "Diagnostic accuracy for current smoking was sensitivity 88%, specificity 92%, NPV 98%, and PPV 63%. From false positives, 85% reported they had quit smoking at the time of the UCC" (Groenhof et al. 2019).<br />
<br />
'''Mining using CCDs versus EHR Tables'''<br />
CCDs (Continuity of Care Documents) are vendor supplied, highly structured documents that offer a compact and convenient way to exchange information. They are generally based on the HLA-7 and ASTM International standard. They are useful for the explicit purpose that they were designed for, such as medication lists or problem lists. While these are convenient for these explicit purposes, EHR tables allow direct access data mining. This allows access to the full set of vitals, lab orders, notes, observations, diagnoses, and other data that can be analyzed. Furthermore, a CCD may be mapped only to a single template of information whereas the EHR table can map out all variations of data. (HealthITAnalytics 2020)<br />
<br />
<br />
==Challenges==<br />
<br />
The greatest challenges in data mining revolve around the interaction between people and information systems. Ownership of information is an evolving discussion that is not just limited to data mining. Information that is collected and analysis is not usually directly owned by the informatics departments who analyze the information (3). Ownership claims can be made by the patient, the EHR system, the clinician, and the data miner depending on context. Subsystems that are integrated into the EHR can also have claims. <br />
<br />
Privacy is of major concern in this context. As massive amounts of data become available to study, the potential for massive data exposure increases. With great power comes great responsibility. Ethical concerns are ever prevalent. Data generated from critically ill patients during clinical care is almost always unconsented. Data mining itself requires data preparation which can compromise patient confidentiality and privacy (3). A common way of solving this problem is through data aggregation. Researchers are often stuck between wanting flexibility to analyze information and information security.<br />
<br />
Challenges in data mining can also be technical. Data collecting formats can vary widely between vendors, states, and specialties (3). The health standards can be varied or nonexistent for some forms of data. Often, clinical context is lost which limits interpretation of the data values. Unstructured data formats such as progress note prose, still valued by clinicians, are difficult to store semantically. <br />
<br />
==Recent Advances in EHR Big Data Mining==<br />
Spatio-temporal Data Mining: as mobile devices with GPS and body position sensors become widely available, spatio-temporal data becomes more prevalent. This can be used in applications of public health including studying the effects of population movements through an epidemic (7).<br />
<br />
mHealth: Patient-provided data in EHR is gaining traction. Patients can input information including home glucose reading, blood pressure measurements, problem lists, medications, demographic and social information, and communications data.<br />
<br />
==References==<br />
<br />
1. Berrouiguet, Sofian, et al. “An Approach for Data Mining of Electronic Health Record Data for Suicide Risk Management: Database Analysis for Clinical Decision Support.” JMIR Mental Health, vol. 6, no. 5, 7 May 2019, p. e9766, 10.2196/mental.9766. Accessed 25 Mar. 2020.<br />
<br />
2. Darst, Burcu, et al. “Data Mining and Machine Learning Approaches for the Integration of Genome-Wide Association and Methylation Data: Methodology and Main Conclusions from GAW20.” BMC Genetics, vol. 19, no. S1, Sept. 2018, 10.1186/s12863-018-0646-3. Accessed 24 Apr. 2020.<br />
<br />
3. Denaxas, Spiros C., et al. “The Tip of the Iceberg: Challenges of Accessing Hospital Electronic Health Record Data for Biological Data Mining.” BioData Mining, vol. 9, no. 1, 22 Sept. 2016, 10.1186/s13040-016-0109-1. Accessed 7 Dec. 2019.<br />
<br />
4. Groenhof, T. Katrien J., et al. “Data Mining Information from EHRs Produced High Yield and Accuracy for Current Smoking Status.” Journal of Clinical Epidemiology, Nov. 2019, 10.1016/j.jclinepi.2019.11.006. Accessed 19 Nov. 2019.<br />
<br />
5. HealthITAnalytics. “Using the EHR to Dive into Data Mining, Clinical Analytics.” HealthITAnalytics, 17 Sept. 2014, healthitanalytics.com/news/using-the-ehr-to-dive-into-data-mining-clinical-analytics. Accessed 24 Apr. 2020.<br />
<br />
6. Lobb, Briallen, and Andrew C Doxey. “Novel Function Discovery through Sequence and Structural Data Mining.” Current Opinion in Structural Biology, vol. 38, June 2016, pp. 53–61, 10.1016/j.sbi.2016.05.017. Accessed 12 Mar. 2020.<br />
<br />
7. Sundermann, Alexander J., et al. “Automated Data Mining of the Electronic Health Record for Investigation of Healthcare-Associated Outbreaks.” Infection Control & Hospital Epidemiology, vol. 40, no. 3, 18 Feb. 2019, pp. 314–319, 10.1017/ice.2018.343. Accessed 24 Apr. 2020.<br />
<br />
<br />
Submitted by Atin Jindal<br />
[[Category:BMI512-SPRING-20]]<br />
<br />
--[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 19:12, 24 April 2020 (UTC)</div>Jindalahttp://clinfowiki.org/wiki/index.php/Mining_Electronic_Health_Record_DataMining Electronic Health Record Data2020-05-06T21:55:54Z<p>Jindala: /* References */</p>
<hr />
<div>==Introduction==<br />
Data mining is a means of leveraging very large data sets (Big Data) to discover meaningful information. In general, it is used to find patterns that are difficult to see using traditional data analysis techniques. Data mining uses techniques from artificial intelligence and statistics, especially machine learning. <br />
<br />
Traditional database reports or queries only report what is in the database to the users. Online analytical processing (OLAP) can be used by an analyst to test hypothesis however the analysis must first formulate the hypothesis. The advantage of data mining techniques over OLAP is the ability to find patterns in the data without formulating an initial hypothesis, possibly finding relationships that an analyst would never have considered.<br />
<br />
Data mining requires large databases or data warehouses which generally require new storage technologies including Hadoop, NoSQL, and NewSQL.<br />
<br />
In electronic health records, data mining has been used to determine the effectiveness of treatment, find patterns among similar patients, interpret genomic data, and optimize clinical decision support.<br />
<br />
==Types of Data==<br />
Generally speaking, data mining requires a "flat file" or vector data format rather than relational tables or objects. This means that all the data for each case of observed values appears as a single record rather than as a normalized relational table. Data can be numerical or categorical (both ordinal or nominal).<br />
<br />
The Meaningful Use program has led to the collection of a Common Clinical Data Set (CCDS) across most providers, and this data is generally available in EHR. Common clinical information includes demographic information, diagnosis, problem lists, family history, allergies, immunizations, medications, procedures, lab values and orders, vital signs, radiology and other reports, cost and billing, genetic information, and social data among many others. Data may be in the form of free-form text (non semantic data) through continuity of care documents or notes, or it may be structured and semantic such as ICD diagnostic code data, Lab data, Drug information, digital images with radiology data. Data can also be administrative including scheduling or billing information.<br />
<br />
Key data format standards exist for some of these data, but not all.<br />
<br />
Patient identifiers: generally there is a unique patient ID (MRN) which may be connected to a health information exchange that is state or health network wide.<br />
<br />
Diagnoses: International Classification of Diseases (ICD), Systematized Nomenclature of Medicine (SNOMED)<br />
<br />
Medications: National Drug Codes (NDCs), RxNorm, Systematized Nomenclature of Medicine’s (SNOMED)<br />
<br />
Procedures: International Classification of Diseases’ Clinical Modification (ICD-CM), Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System (HCPCS)<br />
<br />
Lab data: Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), Current Procedural Terminology (CPT)<br />
<br />
Vital signs: Logical Observation Identifiers Names and Codes (LOINC)<br />
<br />
<br />
==Potential Applications==<br />
<br />
The ultimate purpose of data mining is to provide new information which can turn into new knowledge. This knowledge can be used to gain a competitive advantage, earn a greater profit, improve healthcare outcomes, provide better services, advance scientific knowledge, or increase efficiency of system processes.<br />
<br />
It can be used to '''assess risk'''. A recent study by Berrouiguet et al. analysed a database of health records from 2802 suicide attempters to identify risk factors (2019). They used clustering methods to identify similar patients and regression trees to estimate number of suicide attempts. They then used this information to suggest '''clinical decision support'''. <br />
<br />
Integrating genomics, that is, genomics as seen in the context of physiology, ecology, and evolution to better understand phenotype, has been a great poster child for data mining. Data mining in this context uses molecular data with deep learning, cluster analysis, gene-set enrichment analysis, random forest, among other approaches. Data mining has been used in this context to discover new function through mining genomic sequence data. <br />
<br />
It can be used to '''identify''' an event. For example, Sundermann et al. verified the data mining approach by identifying correct route of transmission and assessing preventable cases in hospital outbreaks. They concluded that "Correct routes were identified for all outbreaks at the second patient, except for one outbreak involving >1 transmission route that was detected at the eighth patient. Up to 40 or 34 infections (78% or 66% of possible preventable infections, respectively) could have been prevented if data mining had been implemented in real time, assuming the initiation of an effective intervention within 7 or 14 days of identification of the transmission route, respectively." (Sundermann et al. 2019). <br />
<br />
It has been used to fill in '''missing information''' in the record by analyzing patterns. Groenhof et al used data mining techniques to identify 1661 participants who did not have a smoking status records in the EHR. They compared the result to a questionnaire that these patients had all received and found that "Diagnostic accuracy for current smoking was sensitivity 88%, specificity 92%, NPV 98%, and PPV 63%. From false positives, 85% reported they had quit smoking at the time of the UCC" (Groenhof et al. 2019).<br />
<br />
'''Mining using CCDs versus EHR Tables'''<br />
CCDs (Continuity of Care Documents) are vendor supplied, highly structured documents that offer a compact and convenient way to exchange information. They are generally based on the HLA-7 and ASTM International standard. They are useful for the explicit purpose that they were designed for, such as medication lists or problem lists. While these are convenient for these explicit purposes, EHR tables allow direct access data mining. This allows access to the full set of vitals, lab orders, notes, observations, diagnoses, and other data that can be analyzed. Furthermore, a CCD may be mapped only to a single template of information whereas the EHR table can map out all variations of data. <br />
<br />
<br />
==Challenges==<br />
<br />
The greatest challenges in data mining revolve around the interaction between people and information systems. Ownership of information is an evolving discussion that is not just limited to data mining. Information that is collected and analysis is not usually directly owned by the informatics departments who analyze the information. Ownership claims can be made by the patient, the EHR system, the clinician, and the data miner depending on context. Subsystems that are integrated into the EHR can also have claims. <br />
<br />
Privacy is of major concern in this context. As massive amounts of data become available to study, the potential for massive data exposure increases. With great power comes great responsibility. Ethical concerns are ever prevalent. Data generated from critically ill patients during clinical care is almost always unconsented. Data mining itself requires data preparation which can compromise patient confidentiality and privacy. A common way of solving this problem is through data aggregation. Researchers are often stuck between wanting flexibility to analyze information and information security.<br />
<br />
Challenges in data mining can also be technical. Data collecting formats can vary widely between vendors, states, and specialties. The health standards can be varied or nonexistent for some forms of data. Often, clinical context is lost which limits interpretation of the data values. Unstructured data formats such as progress note prose, still valued by clinicians, are difficult to store semantically. <br />
<br />
==Recent Advances in EHR Big Data Mining==<br />
Spatio-temporal Data Mining: as mobile devices with GPS and body position sensors become widely available, spatio-temporal data becomes more prevalent. This can be used in applications of public health including studying the effects of population movements through an epidemic. <br />
<br />
mHealth: Patient-provided data in EHR is gaining traction. Patients can input information including home glucose reading, blood pressure measurements, problem lists, medications, demographic and social information, and communications data.<br />
<br />
==References==<br />
<br />
1. Berrouiguet, Sofian, et al. “An Approach for Data Mining of Electronic Health Record Data for Suicide Risk Management: Database Analysis for Clinical Decision Support.” JMIR Mental Health, vol. 6, no. 5, 7 May 2019, p. e9766, 10.2196/mental.9766. Accessed 25 Mar. 2020.<br />
<br />
2. Darst, Burcu, et al. “Data Mining and Machine Learning Approaches for the Integration of Genome-Wide Association and Methylation Data: Methodology and Main Conclusions from GAW20.” BMC Genetics, vol. 19, no. S1, Sept. 2018, 10.1186/s12863-018-0646-3. Accessed 24 Apr. 2020.<br />
<br />
3. Denaxas, Spiros C., et al. “The Tip of the Iceberg: Challenges of Accessing Hospital Electronic Health Record Data for Biological Data Mining.” BioData Mining, vol. 9, no. 1, 22 Sept. 2016, 10.1186/s13040-016-0109-1. Accessed 7 Dec. 2019.<br />
<br />
4. Groenhof, T. Katrien J., et al. “Data Mining Information from EHRs Produced High Yield and Accuracy for Current Smoking Status.” Journal of Clinical Epidemiology, Nov. 2019, 10.1016/j.jclinepi.2019.11.006. Accessed 19 Nov. 2019.<br />
<br />
5. HealthITAnalytics. “Using the EHR to Dive into Data Mining, Clinical Analytics.” HealthITAnalytics, 17 Sept. 2014, healthitanalytics.com/news/using-the-ehr-to-dive-into-data-mining-clinical-analytics. Accessed 24 Apr. 2020.<br />
<br />
6. Lobb, Briallen, and Andrew C Doxey. “Novel Function Discovery through Sequence and Structural Data Mining.” Current Opinion in Structural Biology, vol. 38, June 2016, pp. 53–61, 10.1016/j.sbi.2016.05.017. Accessed 12 Mar. 2020.<br />
<br />
7. Sundermann, Alexander J., et al. “Automated Data Mining of the Electronic Health Record for Investigation of Healthcare-Associated Outbreaks.” Infection Control & Hospital Epidemiology, vol. 40, no. 3, 18 Feb. 2019, pp. 314–319, 10.1017/ice.2018.343. Accessed 24 Apr. 2020.<br />
<br />
<br />
Submitted by Atin Jindal<br />
[[Category:BMI512-SPRING-20]]<br />
<br />
--[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 19:12, 24 April 2020 (UTC)</div>Jindalahttp://clinfowiki.org/wiki/index.php/Mining_Electronic_Health_Record_DataMining Electronic Health Record Data2020-04-24T19:12:05Z<p>Jindala: In electronic health records, data mining has been used to determine the effectiveness of treatment, find patterns among similar patients, interpret genomic data, and optimize clinical decision support.</p>
<hr />
<div>==Introduction==<br />
Data mining is a means of leveraging very large data sets (Big Data) to discover meaningful information. In general, it is used to find patterns that are difficult to see using traditional data analysis techniques. Data mining uses techniques from artificial intelligence and statistics, especially machine learning. <br />
<br />
Traditional database reports or queries only report what is in the database to the users. Online analytical processing (OLAP) can be used by an analyst to test hypothesis however the analysis must first formulate the hypothesis. The advantage of data mining techniques over OLAP is the ability to find patterns in the data without formulating an initial hypothesis, possibly finding relationships that an analyst would never have considered.<br />
<br />
Data mining requires large databases or data warehouses which generally require new storage technologies including Hadoop, NoSQL, and NewSQL.<br />
<br />
In electronic health records, data mining has been used to determine the effectiveness of treatment, find patterns among similar patients, interpret genomic data, and optimize clinical decision support.<br />
<br />
==Types of Data==<br />
Generally speaking, data mining requires a "flat file" or vector data format rather than relational tables or objects. This means that all the data for each case of observed values appears as a single record rather than as a normalized relational table. Data can be numerical or categorical (both ordinal or nominal).<br />
<br />
The Meaningful Use program has led to the collection of a Common Clinical Data Set (CCDS) across most providers, and this data is generally available in EHR. Common clinical information includes demographic information, diagnosis, problem lists, family history, allergies, immunizations, medications, procedures, lab values and orders, vital signs, radiology and other reports, cost and billing, genetic information, and social data among many others. Data may be in the form of free-form text (non semantic data) through continuity of care documents or notes, or it may be structured and semantic such as ICD diagnostic code data, Lab data, Drug information, digital images with radiology data. Data can also be administrative including scheduling or billing information.<br />
<br />
Key data format standards exist for some of these data, but not all.<br />
<br />
Patient identifiers: generally there is a unique patient ID (MRN) which may be connected to a health information exchange that is state or health network wide.<br />
<br />
Diagnoses: International Classification of Diseases (ICD), Systematized Nomenclature of Medicine (SNOMED)<br />
<br />
Medications: National Drug Codes (NDCs), RxNorm, Systematized Nomenclature of Medicine’s (SNOMED)<br />
<br />
Procedures: International Classification of Diseases’ Clinical Modification (ICD-CM), Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System (HCPCS)<br />
<br />
Lab data: Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), Current Procedural Terminology (CPT)<br />
<br />
Vital signs: Logical Observation Identifiers Names and Codes (LOINC)<br />
<br />
<br />
==Potential Applications==<br />
<br />
The ultimate purpose of data mining is to provide new information which can turn into new knowledge. This knowledge can be used to gain a competitive advantage, earn a greater profit, improve healthcare outcomes, provide better services, advance scientific knowledge, or increase efficiency of system processes.<br />
<br />
It can be used to '''assess risk'''. A recent study by Berrouiguet et al. analysed a database of health records from 2802 suicide attempters to identify risk factors (2019). They used clustering methods to identify similar patients and regression trees to estimate number of suicide attempts. They then used this information to suggest '''clinical decision support'''. <br />
<br />
Integrating genomics, that is, genomics as seen in the context of physiology, ecology, and evolution to better understand phenotype, has been a great poster child for data mining. Data mining in this context uses molecular data with deep learning, cluster analysis, gene-set enrichment analysis, random forest, among other approaches. Data mining has been used in this context to discover new function through mining genomic sequence data. <br />
<br />
It can be used to '''identify''' an event. For example, Sundermann et al. verified the data mining approach by identifying correct route of transmission and assessing preventable cases in hospital outbreaks. They concluded that "Correct routes were identified for all outbreaks at the second patient, except for one outbreak involving >1 transmission route that was detected at the eighth patient. Up to 40 or 34 infections (78% or 66% of possible preventable infections, respectively) could have been prevented if data mining had been implemented in real time, assuming the initiation of an effective intervention within 7 or 14 days of identification of the transmission route, respectively." (Sundermann et al. 2019). <br />
<br />
It has been used to fill in '''missing information''' in the record by analyzing patterns. Groenhof et al used data mining techniques to identify 1661 participants who did not have a smoking status records in the EHR. They compared the result to a questionnaire that these patients had all received and found that "Diagnostic accuracy for current smoking was sensitivity 88%, specificity 92%, NPV 98%, and PPV 63%. From false positives, 85% reported they had quit smoking at the time of the UCC" (Groenhof et al. 2019).<br />
<br />
'''Mining using CCDs versus EHR Tables'''<br />
CCDs (Continuity of Care Documents) are vendor supplied, highly structured documents that offer a compact and convenient way to exchange information. They are generally based on the HLA-7 and ASTM International standard. They are useful for the explicit purpose that they were designed for, such as medication lists or problem lists. While these are convenient for these explicit purposes, EHR tables allow direct access data mining. This allows access to the full set of vitals, lab orders, notes, observations, diagnoses, and other data that can be analyzed. Furthermore, a CCD may be mapped only to a single template of information whereas the EHR table can map out all variations of data. <br />
<br />
<br />
==Challenges==<br />
<br />
The greatest challenges in data mining revolve around the interaction between people and information systems. Ownership of information is an evolving discussion that is not just limited to data mining. Information that is collected and analysis is not usually directly owned by the informatics departments who analyze the information. Ownership claims can be made by the patient, the EHR system, the clinician, and the data miner depending on context. Subsystems that are integrated into the EHR can also have claims. <br />
<br />
Privacy is of major concern in this context. As massive amounts of data become available to study, the potential for massive data exposure increases. With great power comes great responsibility. Ethical concerns are ever prevalent. Data generated from critically ill patients during clinical care is almost always unconsented. Data mining itself requires data preparation which can compromise patient confidentiality and privacy. A common way of solving this problem is through data aggregation. Researchers are often stuck between wanting flexibility to analyze information and information security.<br />
<br />
Challenges in data mining can also be technical. Data collecting formats can vary widely between vendors, states, and specialties. The health standards can be varied or nonexistent for some forms of data. Often, clinical context is lost which limits interpretation of the data values. Unstructured data formats such as progress note prose, still valued by clinicians, are difficult to store semantically. <br />
<br />
==Recent Advances in EHR Big Data Mining==<br />
Spatio-temporal Data Mining: as mobile devices with GPS and body position sensors become widely available, spatio-temporal data becomes more prevalent. This can be used in applications of public health including studying the effects of population movements through an epidemic. <br />
<br />
mHealth: Patient-provided data in EHR is gaining traction. Patients can input information including home glucose reading, blood pressure measurements, problem lists, medications, demographic and social information, and communications data.<br />
<br />
==References==<br />
<br />
Berrouiguet, Sofian, et al. “An Approach for Data Mining of Electronic Health Record Data for Suicide Risk Management: Database Analysis for Clinical Decision Support.” JMIR Mental Health, vol. 6, no. 5, 7 May 2019, p. e9766, 10.2196/mental.9766. Accessed 25 Mar. 2020.<br />
<br />
Darst, Burcu, et al. “Data Mining and Machine Learning Approaches for the Integration of Genome-Wide Association and Methylation Data: Methodology and Main Conclusions from GAW20.” BMC Genetics, vol. 19, no. S1, Sept. 2018, 10.1186/s12863-018-0646-3. Accessed 24 Apr. 2020.<br />
<br />
Denaxas, Spiros C., et al. “The Tip of the Iceberg: Challenges of Accessing Hospital Electronic Health Record Data for Biological Data Mining.” BioData Mining, vol. 9, no. 1, 22 Sept. 2016, 10.1186/s13040-016-0109-1. Accessed 7 Dec. 2019.<br />
<br />
Groenhof, T. Katrien J., et al. “Data Mining Information from EHRs Produced High Yield and Accuracy for Current Smoking Status.” Journal of Clinical Epidemiology, Nov. 2019, 10.1016/j.jclinepi.2019.11.006. Accessed 19 Nov. 2019.<br />
<br />
HealthITAnalytics. “Using the EHR to Dive into Data Mining, Clinical Analytics.” HealthITAnalytics, 17 Sept. 2014, healthitanalytics.com/news/using-the-ehr-to-dive-into-data-mining-clinical-analytics. Accessed 24 Apr. 2020.<br />
<br />
Lobb, Briallen, and Andrew C Doxey. “Novel Function Discovery through Sequence and Structural Data Mining.” Current Opinion in Structural Biology, vol. 38, June 2016, pp. 53–61, 10.1016/j.sbi.2016.05.017. Accessed 12 Mar. 2020.<br />
<br />
Sundermann, Alexander J., et al. “Automated Data Mining of the Electronic Health Record for Investigation of Healthcare-Associated Outbreaks.” Infection Control & Hospital Epidemiology, vol. 40, no. 3, 18 Feb. 2019, pp. 314–319, 10.1017/ice.2018.343. Accessed 24 Apr. 2020.<br />
<br />
<br />
Submitted by Atin Jindal<br />
[[Category:BMI512-SPRING-20]]<br />
<br />
--[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 19:12, 24 April 2020 (UTC)</div>Jindalahttp://clinfowiki.org/wiki/index.php/Basic_statistical_conceptsBasic statistical concepts2020-04-21T18:22:52Z<p>Jindala: </p>
<hr />
<div>==Statistical principles==<br />
In his autobiography, Mark Twain identified three types of lies: “lies, damned lies, and statistics.” (Twain, 2012) The role of statistical misrepresentation has continued into modern medicine, as well. In an interview with [http://www.theatlantic.com/magazine/archive/2010/11/lies-damned-lies-and-medical-science/308269/ ''The Atlantic''], Dr John Ioannidis had this to say about current scientific rigor and bias, “At every step in the process, there is room to distort results, a way to make a stronger claim or to select what is going to be concluded.” (Freedman, 2010) In September 2014, JAMA published a review of the re-analysis of randomized clinical trial data; 35% of had changed conclusions based on independent review of the data. (Ebrahim et al., 2014)<br />
With medicine’s current focus on delivering improved quality care through the use of evidenced based medicine ([[EBM]]), clinicians should have a basic understanding of key statistical concepts.<br />
<br />
==Sensitivity==<br />
At its most basic, sensitivity is a measurement of well one’s test does finding the people with the disease in the entire population. In a test with low sensitivity, there will be people in the screened population which have a negative test despite having the disease. In a highly sensitive test, everyone in the tested population that has the disease will have a positive test.<br />
<br />
{| class="wikitable"<br />
|+2 x 2 table<br />
|-<br />
|<br />
|Disease +<br />
|Disease -<br />
|-<br />
|Test +<br />
|True + (A)<br />
|False + (B)<br />
|-<br />
|Test -<br />
|False - (C)<br />
|True - (D)<br />
|}<br />
<br />
Looking at the results of figure one, you can calculate the sensitivity of a test by dividing the true positive results over the entire population with the disease. Done mathematically, the sensitivity is A/(A+C).<br />
<br />
Applying this to clinical research, one could look at the prostate screening antigen (PSA). PSA goes up in cases of prostate cancer, so it has been used as a screening test. Initially, a PSA level of 4 ng/ml or higher was considered a positive screen. Because there were cases of prostate cancer missed with this cutoff, there was a discussion of lowering the threshold to 2.5 ng/ml. This would decrease the number of missed patients with prostate cancer, thereby increasing the sensitivity of the test. (Welch, Schwartz, & Woloshin, 2005) Not only was the level not lowered to 2.5 ng/ml, the [http://www.uspreventiveservicestaskforce.org/Page/Document/RecommendationStatementFinal/prostate-cancer-screening#citation16 US Preventive Service Task Force] recommended against screening for PSA at all. This has to do in part with the “specificity” of the PSA screen.<br />
<br />
==Specificity==<br />
Whereas sensitivity focused on not missing cases of the disease, specificity focuses on not reporting that a patient has the disease when he does not. A test with low specificity, such as PSA, has many positive test results where the individual does not have the disease. The newborn screen for phenylketonuria is over 99% specific. (Kwon & Farrell, 2000)<br />
Specificity is calculated by true negatives divided by true negatives plus false positives. Referring back to our two by two table in figure one, sensitivity is D/(B+D).<br />
==Bayes’ Theorem==<br />
In the 18th century, Rev. Thomas Bayes developed a theory regarding conditional probabilities. Among other applications, this has been used to interpret the sensitivity and specificity of a diagnostic test. The two most commonly derived values from it are the positive predictive value (PPV) and negative predictive value (NPV). Putting the positive predictive value in clinical terms; what is the probability that the patient has the disease given a positive test? The negative predictive value represents the probability that the patient does not have the disease given a negative test.<br />
==Positive predictive value==<br />
Representing this in plain language, the PPV is the prevalence of the disease multiplied by the sensitivity of the test; you then divide that number by itself plus the prevalence of patients not having the disease multiplied by the false positive rate. In the following formula, T represents the test and D represents the disease; the + and – represent whether they are positive or negative. The notation P(T-|D+) is translated the probability that the test is negative given that the disease is positive; this is equal to 1 – sensitivity.<br />
<br />
PPV = (P(D+) * P(T+|D+)) / (P(D+) * P(T+|D+) + P(D-) * P(T+|D-))<br />
<br />
To simplify this even further, the positive predictive value goes up when the test is more sensitive, the disease is more prevalent, and the false positive rate is low.<br />
<br />
Returning to our example of the highly specific PKU newborn screen, although the test is highly specific, the prevalence is low. This leads to a relatively low positive predictive value despite the high specificity.<br />
==Negative predictive value==<br />
The negative predictive value is calculated similarly.<br />
<br />
NPV = (P(D-) * P(T-|D-)) / (P(D-) * P(T-|D-) + P(D+) * P(T-|D+))<br />
<br />
Negative predictive values increase when the prevalence of the disease is low and the test is not sensitive.<br />
<br />
==Likelihood Ratio== <br />
<br />
In plain language, the likelihood ratio is the likelihood that the result of a test is expected given a variable (disease/risk), compared to the likelihood of the same result without the variable. In clinical practice, likelihood ratios are used to assess the utility of a diagnostic test and to guage how likely it is that a patient has a disease given a particular risk or previous test result (for example to evaluate pre-test probability). Both a positive and negative number is calculate which corresponds to a positive and negative result.<br />
<br />
LR+ = sensitivity/(1-specificity)<br />
LR- = (1-sensitivity)/specificity<br />
<br />
LR+ > 10 or LR- < 0.1 are indicative of a useful test.<br />
LRs can be multiplied with pre-test odds to estimate post-test odds.<br />
<br />
--[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 18:22, 21 April 2020 (UTC)<br />
==Risk Quantification==<br />
<br />
In basic statistics, risk is quantified based on a exposure/risk/intervention x outcome contingency table.<br />
{| class="wikitable"<br />
|+2 x 2 table<br />
|-<br />
|<br />
|Outcome +<br />
|Outcome -<br />
|-<br />
|Exposure +<br />
|A<br />
|B<br />
|-<br />
|Exposure -<br />
|C<br />
|D<br />
|}<br />
<br />
'''Odds ratio'''<br />
The odds of an outcome given an exposure versus the odds of the same outcome without that exposure. Typically used in case-control studies where the outcome is known.<br />
<br />
OR = ad/bc<br />
<br />
Example: In a case-control study, 50/100 of postmenopausal women and 25/100 of premenopausal women developed osteoporosis. The OR ratio is thus (50)(75)/(50)(25) = 3. Thus the odds of a postmenopausal woman getting osteoporosis is 3 times the odds of a premenopausal woman.<br />
<br />
'''Relative risk'''<br />
The risk of the outcome in an exposed group divided by the risk of the outcome in the unexposed group. Typically used in a cohort study where the exposure is known. <br />
<br />
RR = a/(a+b) / c/(c+d)<br />
<br />
Example: In a cohort study, 25/100 women given estrogen vs 50/100 women not given estrogen developed osteoporosis. The RR is thus (25/25+75)/(50/(50+50) = 25/100 / 50/100 = 0.5. Thus women given estrogen have a 50% risk reduction in developing osteoporosis.<br />
<br />
'''Relative risk reduction'''<br />
The risk reduction that is attributable to the exposure or intervention.<br />
<br />
RRR = 1 - RR<br />
<br />
'''Absolute risk reduction'''<br />
The absolute difference in risk between exposed and unexposed groups.<br />
<br />
ARR = c/(c+d) - a/(a+b)<br />
<br />
'''Number needed to treat'''<br />
Number of patients who need to be treated for 1 patient to benefit. <br />
<br />
NNT = 1/ARR<br />
<br />
'''Number needed to harm'''<br />
Number of patients who need to be exposed for 1 patient to be harmed.<br />
<br />
NNH = 1/AR<br />
<br />
--[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 18:22, 21 April 2020 (UTC)<br />
==Incidence and Prevalence==<br />
Incidence is the number of new cases per unit of time.<br />
Incidence = # new cases in given time frame / population at risk<br />
<br />
Prevalence is a cross section of all current cases in the population.<br />
Prevalence = # total cases / population at risk<br />
<br />
--[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 18:22, 21 April 2020 (UTC)<br />
==Precision and Accuracy==<br />
Precision is the reliability or consistency of a test. Increased precision means less '''random''' variation, lower standard deviation.<br />
Accuracy is the trueness or validity of the test. <br />
<br />
--[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 18:22, 21 April 2020 (UTC)<br />
<br />
==Practical applications==<br />
'''High vs low prevalence'''<br />
<br />
For the following two situations, we’ll be looking at a hypothetical disease called [http://www.youtube.com/watch?v=_cOlJdObNlc Kirk Syndrome] which causes a dyspraxic, halting speech. Prior to 1966, the prevalence of the disease was 1 per 100,000. From 1966 on, there was a dramatic increase in prevalence likely secondary to some unknown environmental factor; the rates skyrocketed to 1 in every 1000. The screening test was 90% sensitive and 90% specific.<br />
<br />
Using the Bayes theorem with pre-1966 data:<br />
<br />
PPV = (0.9 * 0.00001) / (0.00001 * 0.9 + 0.99999 * 0.1) = 0.009%<br />
<br />
NPV = (0.9*(1 – 0.00001)) / (0.9 * (1 – 0.00001) + (0.00001 * (1 – 0.9)) = 99.999% <br />
<br />
When we use post-1966 data:<br />
<br />
PPV = (0.9 * 0.001) / (0.001 * 0.9 + 0.999 * 0.1) = 0.893%<br />
<br />
NPV = (0.9*(1 – 0.001)) / (0.9 * (1 – 0.001) + (0.001 * (1 – 0.9)) = 99.989%<br />
<br />
Although there was a slight change in the negative predictive value for this test, there was a dramatic change in the positive predictive value. In the pre-1966 data only 0.009% of positive tests would have the disease, but after the disease became more prevalent it was nearly 0.892%--a nearly 100-fold increase.<br />
<br />
'''High vs low sensitivity or specificity'''<br />
<br />
As time has gone forward, there has been significant pressure for early diagnosis and management of Kirk Syndrome. It turns out halting speech has increased the spread of sexually-communicable, alien diseases after warp technology was developed. The prevalence has stayed set a one per 1000, but the test was tweaked to increase sensitivity to 99% with a concomitant decrease in specificity to 80%.<br />
<br />
PPV = (0.99 * 0.001) / (0.001 * 0.99 + 0.999 * 0.2) = 0.493%<br />
<br />
NPV = (0.8*(1 – 0.001)) / (0.8 * (1 – 0.001) + (0.001 * (1 – 0.99)) = 99.998%<br />
<br />
This increase in sensitivity meant that there was a 0.01% increased chance that someone with the disease would not be missed, but only a 0.493% chance that a positive test meant that the person had the disease. Increased sensitivity increased the negative predictive value while decreasing the positive predictive value.<br />
<br />
<br />
==References==<br />
Ebrahim, S., Sohani, Z., Montoya, L., Agarwal, A., Thorlund, K., Mills, E., & Ioannidis, J. (2014). Reanalyses of randomized clinical trial data. JAMA, 312(10), 1024–32. doi:http://dx.doi.org/10.1001/jama.2014.9646<br />
<br />
Freedman, D. H. (2010, November). Lies, Damned Lies, and Medical Science. The Atlantic. Retrieved from http://www.theatlantic.com/magazine/archive/2010/11/lies-damned-lies-and-medical-science/308269/<br />
<br />
Kwon, C., & Farrell, P. (2000). The magnitude and challenge of false-positive newborn screening test results. Archives of Pediatrics & Adolescent Medicine, 154(7), 714–8.<br />
<br />
Twain, M. (2012). Autobiography of Mark Twain: Volume 1, Reader’s Edition. (H. E. Smith, B. Griffin, V. Fischer, M. B. Frank, S. Goetz, & L. D. Myrick, Eds.) (Reprint edition.). Berkeley, Calif.; London: University of California Press.<br />
<br />
Pagano, M. (2000). Principles of Biostatistics, Second edition. Cengage Learning.<br />
<br />
==Related posts==<br />
<br />
Evidence based medicine: [[EBM]]<br />
<br />
Analysis of variation: [[ANOVA]]<br />
<br />
[[Category:BMI512-FALL-14]]<br />
<br />
Submitted by Benj Barsotti<br />
<br />
<br />
Edited by Atin Jindal --[[User:Jindala|Jindala]] ([[User talk:Jindala|talk]]) 18:22, 21 April 2020 (UTC)<br />
[[Category:BMI512-SPRING-20]]</div>Jindala