Big Data

From Clinfowiki
Jump to: navigation, search


Observers believe we are in the midst of an unprecedented data boom in healthcare(1, 2). Sources of such data include genomic databases, digital format patient records (encounters, images in EMR etc.), vital sign monitors, wearable devices, administrative claims and social network posts(1). Such voluminous data is thought to be “resistant” to analysis using conventional software, hardware and statistical techniques(3, 4). This key characteristic has played a role defining ‘big data’. Experts have called for innovative thinking and analytic strategies to harness these ‘big data’ grouped under the umbrella of ‘advanced analytics’ (2). This is an introductory post undertaken to define the theoretical framework, practical techniques and challenges that define big data and advanced analytics in the context of healthcare.

Materials and Methods:

Search was performed using bibliographic and citation databases - Google Scholar and MEDLINE (PubMed). Key words used were ‘big data’, ‘advanced analytics’, ‘predictive analytics’ ‘medicine’, and ‘healthcare’.

Theoretical framework of big data and advanced analytics:

Big data has been popularly described as sharing the 4 V’s paradigm - velocity, volume, variety and veracity(5) . Advanced analytics of big data differs from conventional tools of clinical epidemiology(2, 6).In epidemiology, priority is given to defining "causal mechanisms" that could explain known events in the wider population. Big data is geared towards identifying correlations at the present moment (‘present truth’) in a “locally stable” sample (example- ICU population of a particular healthcare system)(6).Big data also calls for a shift from epidemiological emphasis on hypothesis to a more data- centric approach. This calls for an equal weightage to “inductive reasoning” along with “deductive reasoning” in research methodologies(2).These are prognostic questions (What is going to happen?) and patterning questions (What are the signs and symptoms that constitute a disease diagnosis?)(6). Big data can fill the void created by epidemiology’s inability to answer predictive (What will happen if it is done differently?) and prescriptive questions (Should I do something different for this particular patient?) in studies not built on causal models(6).It is likely that clinical outcomes are emergent phenomenon from complex interactions among individual agents (genes, environment, diet and occupational risk factors) – the so-called ‘complex adaptive systems’ as opposed to the reductionist simpler cause-effect models’(7).Therefore traditional techniques should be supplemented by innovations in data science such as big data and advanced analytics(2).

Tools of advanced analytics:

a) Visual analytics: A brief introduction to visual analytics has been posted in this wiki [[1]] .

b) Machine learning: Machine learning is a necessary prerequisite for success of big data analytics(2).Considering the size of data that can accumulate from a small location such as 12 bed ICU, it is pertinent that data be curated on an automatic basis by machines themselves(3).This has been likened to evolution of neural networks in a human brain, hence the name.

c) Data mining: Data mining methodologies have been described in detail in the literature. An example is “collaborative filtering methodology" - an instrument used to predict individual preferences based upon the preferential data of a large cohort(8, 9). This has been reported as the theory behind Netflix and Amazon’s very successful personalized predictions(8).

d)Queries, Reports and Online Analytical Processing (OLAP): These are similar to the well-known queries and reports generated using database applications. However, the size and methodology differs in big data methods. One of the popular data processing platforms (Hadoop) is described in this section. Hadoop is an open source technology which relies on storage and retrieval of data using non tabular means(1).This means Hadoop can work with both structured and unstructured data. Hadoop breaks down the computational process into smaller sets of problems and makes use of computing resource in a network(1).The final output is then merged and presented to the user(1).


Big data tools have the ability to crunch unstructured datasets at a reduced cost - a never before heard phenomenon in the universe of traditional epidemiological and statistical disciplines(2, 6).They have an intrinsic cost advantage as once established data gathering devices can potentially gather data on a ‘permanent’ basis. Advanced analyses of big data can help us achieve the cherished goals of modern medicine– “predictive, preventive, personalized and participatory” (8)in a timely fashion(9).Big data output can help the healthcare system transition into a “learning healthcare system” characterized by constant and unimpeded exchange of information between clinical research and operations(6).


Few examples will be provided to help the reader gain a bird's eye view of big data applications

a) Predictive models in clinical care: Authors have developed a model to predict shock in the ICU based on continuous analysis of ECG waveforms recorded via telemetry(10).

b) Detecting fraud:A brief introduction to the potential for predicting/detecting fraud has been posted in this wiki[[2]].

c) Asthmapolis: GPS enabled tracker that monitors inhaler usage by asthmatics(11). Information is ported to a central database and used to identify individual, group and population-based trends. The data is overlaid on other data streams such as pollen count maps. This helps care providers to practice personalized medicine as well as predict an upcoming exacerbation(11).


The term ‘big’ is not necessarily ‘superior’(6).Current perceptions of research community favor epidemiological causal data over correlative data generated by big data analytics(6).Data artifacts (“noise”) can easily creep in given the magnitude of data that can accrue over a period of time(3). Privacy and ethical concerns are very appropriate given the sometimes seemingly non clinical sources of big data such as social media posts, smartphones and wearable devices(9).Validity of output from big data analytic methods is a concern as evidenced by Google Flu’s failure to reasonably predict the prevalence in 2013(12).Currently big data tools require expertise in intensive programming for execution. Such analytical tools need to evolve into a pull down menu type of interface familiar to users of common operating systems and statistical software packages(1).


Big data centric health systems are likely to benefit in terms of outcomes, improved care and reduced costs. Big data will likely not replace the time-tested traditions of clinical epidemiology. Big data and advanced analytics are likely to augment epidemiological methods and fill the void where questions cannot be answered using existing epidemiological methods(6).It is likely to find its niche in the pyramid of evidence. As is true for any emerging discipline, big data can suffer from over expectations(4) .Big data obviously cannot solve the basic challenges of human behavior (6).There is a need for big data trained informatics in the future.


1. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;2(1):3.

2. Krumholz HM. Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff (Millwood). 2014;33(7):1163-70.

3. Simpao AF, Ahumada LM, Rehman MA. Big data and visual analytics in anaesthesia and health care. Br J Anaesth. 2015;115(3):350-6.

4. Buhl HU, Röglinger M, Moser F, Heidemann J. Big data. Business & Information Systems Engineering. 2013;5(2):65-9.

5. Beyer MA, Laney D. The importance of ‘big data’: a definition. Stamford, CT: Gartner. 2012:2014-8.

6. Iwashyna TJ, Liu V. What’s So Different about Big Data?. A Primer for Clinicians Trained to Think Epidemiologically. Annals of the American Thoracic Society. 2014;11(7):1130-5.

7. Plsek PE, Greenhalgh T. Complexity science: The challenge of complexity in health care. BMJ. 2001;323(7313):625-8.

8. Bell RM, Koren Y, editors. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. Data Mining, 2007 ICDM 2007 Seventh IEEE International Conference on; 2007: IEEE.

9. Chawla NV, Davis DA. Bringing big data to personalized healthcare: a patient-centered framework. J Gen Intern Med. 2013;28 Suppl 3(3):S660-5.

10. Hood L, Flores M. A personal view on systems medicine and the emergence of proactive P4 medicine: predictive, preventive, personalized and participatory. N Biotechnol. 2012;29(6):613-24.

11. Thorpe JH, Gray EA. Big data and ambulatory care: breaking down legal barriers to support effective use. The Journal of ambulatory care management. 2015;38(1):29.

12. Anthony Celi L, Mark RG, Stone DJ, Montgomery RA. “Big data” in the intensive care unit. Closing the data loop. American journal of respiratory and critical care medicine. 2013;187(11):1157-60.

13. Dolan B. Asthmapolis Secures FDA Clearance for Inhaler Sensor. MobileHealthNews com. 2012.

14. Lazer D, Kennedy R, King G, Vespignani A. The parable of Google Flu: traps in big data analysis. Science. 2014;343(14 March).