Hadoop. How to handle all that data

From Clinfowiki
Revision as of 02:03, 20 May 2010 by Hoshovsk@ohsu.edu (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


What is Hadoop:
Is a framework based of the concept of map-reduce[1]. It is being used for "data-intensive distributed applications" [2]. The basic concept is that we can store data easily on larger and larger hard disks. But trying to transfer all this data to a centralized area to then run statistical or other information processing code ( such as image process) is extremely time consuming. Thus the idea is to move the processing algorithms to the computers with the storage. An example for this can be image processing. Imagine attempting to move all the data from all the PACS to one computer and them attempting to analyze all Lung X-rays for patterns. Just the time to transfer this over the networks would be excessive. Instead, the algorithm for the analysis for the patterns can be sent to all the PACS ( or a computer local to it) and the processing can occur there. Then the analyzed data ( patterns of what you are looking for) can then be sent back to a central spot for comparison processing across the images.

Why do we need it:
Data!. Massive amounts of it. It is estimated that we will have 1.8 zettabytes of data by 2011 [3]. In the above example we probably already have too much data in PACS to move it to a central location for large statistical image processing studies One example of image processing would in the area of medical image categorization where the individual image computations are quite complex[4]. With the introduction of EHR's, where what is now inaccessible data will be come accessible, the amount of data that we could analyze will become overwhelming. Even though the computations may not be a complex as image processing, just the ability to do statistics on the amount of data will be time-consuming but important work for public health.

What is Hadoop being used for the bioinformatics area:
The list of Hadoop projects is already getting quite long. From behavioral social studies like 'Large-Scale Behavioral Targeting' to studies of 'prescription compatibility', down to Genomics and studies like 'Multiple Alignment of protein sequences...'[5]

The Hadoop framework is being used by many companies already ( Amazon, Google and OHSU [6]) and it's growth will increase and move into the health informatics area.

[1]Hadoop: The Definitive Guide MapReduce for the Cloud By Tom White Publisher: O'Reilly Media Released: May 2009
[3]World's Data to Reach 1.8 Zettabytes by 2011 http://www.dailytech.com/Worlds+Data+to+Reach+18+Zettabytes+by+2011/article11055.htm
[4]Medical Image Categorization and Retrieval For PACS Using the GMM-KL Framework http://www.almaden.ibm.com/cs/projects/aalim/papers/IEEE_Medical_Categorization.pdf
[5]Mapreduce & Hadoop Algorithms in Academic Papers (3rd update) http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/

Submitted by (Gregg Hoshovsky)