Design and Development of a Medical Big Data Processing System Based on HadoopJ Med Syst


Qin Yao, Yu Tian, Peng-Fei Li, Li-Li Tian, Yang-Ming Qian, Jing-Song Li
Medicine (miscellaneous) / Information Systems / Health Information Management / Health Informatics


Traceability and Provenance in Big Data Medical Systems

Richard McClatchey, Jetendr Shamdasani, Andrew Branson, Kamran Munir, Zsolt Kovacs, Giovanni Frisoni

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Xiaoyi Lu, Fan Liang, Bing Wang, Li Zha, Zhiwei Xu



Design and Development of aMedical Big Data Processing System

Based on Hadoop

Qin Yao & Yu Tian & Peng-Fei Li & Li-Li Tian &

Yang-Ming Qian & Jing-Song Li

Received: 24 December 2014 /Accepted: 26 January 2015 # Springer Science+Business Media New York 2015

Abstract Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital information systems that are designing and expanding services. Big data has four characteristics –

Volume, Variety, Velocity and Value (the 4 Vs) – that make traditional systems incapable of processing these data using standalones. Apache HadoopMapReduce is a promising software framework for developing applications that process vast amounts of data in parallel with large clusters of commodity hardware in a reliable, fault-tolerant manner. With the Hadoop framework and MapReduce application program interface (API), we can more easily develop our own MapReduce applications to run on a Hadoop framework that can scale up from a single node to thousands of machines. This paper investigates a practical case of a Hadoop-based medical big data processing system. We developed this system to intelligently process medical big data and uncover some features of hospital information system user behaviors. This paper studies user behaviors regarding various data produced by different hospital information systems for daily work. In this paper, we also built a five-node Hadoop cluster to execute distributed

MapReduce algorithms. Our distributed algorithms show promise in facilitating efficient data processing with medical big data in healthcare services and clinical research compared with single nodes. Additionally, with medical big data analytics, we can design our hospital information systems to be much more intelligent and easier to use by making personalized recommendations.

Keywords Medical big data . Hadoop .MapReduce .

User-generated content . Sqoop .Mahout


Health information technologies (HITs) have been pushing healthcare services and clinic research to quickly improve, but the medical data that are produced still maintain some inherent features and complexities that are difficult to address.

Medical data sets are continuously becoming larger, thus making it increasingly difficult for standalone systems to process medical data. Understanding the logic behindmedical big data (MBD) has great significance for designing hospital information systems (HISs) that can be used for recommending services and may help overcome some barriers to HIT adoption [1–4] by exploring better functions of those systems.

Recent developments in open-source software (OSS) – namely, theApacheHadoop Foundation and related projects – provide a framework for petabyte-scale data warehouses on Linux clusters, thus enabling fault-tolerant parallelized analysis of MBD using the MapReduce [5, 6] programing model. Awide variety of organizations and researchers have used Hadoop for healthcare services and clinical research projects [7–9]. Taylor,

R. C. (2010) gave a detailed introduction to how Hadoop is used in bioinformatics [10], and SchatzMC (2009) developed anOSS package named CloudBurst that provides a model for

This article is part of the Topical Collection on Transactional Processing


Q. Yao :Y. Tian : P.<F. Li : J.<S. Li (*)

Engineering Research Center of EMR and Intelligent Expert System,

Ministry of Education, Collaborative Innovation Center for

Diagnosis and Treatment of Infectious Diseases, College of

Biomedical Engineering and Instrument Science, Zhejiang

University, Hangzhou, China e-mail:

L.<L. Tian :Y.<M. Qian

Navy General Hospital, Beijing, China

J Med Syst (2015) 39:23

DOI 10.1007/s10916-015-0220-8 parallelizing algorithms using Hadoop MapReduce [11].

Recently, analysis of user behaviors has become a popular approach to understand users, and Hadoop performs well at analyzing user behaviors by mining large-scale data sets [12, 13].

Although studies of human behaviors have long been performed in the fields of healthcare services and clinical research, quantitative analyses based on large-scale data have not been common because of a lack of methods for big data analytics. Some of these studies have used traditional methodologies such as survey questionnaires and have focused primarily on individuals or small groups of respondents [14–18]. Other studies have performed quantitative analyses but with methods that could only analyze small-scale data sets and required pre-established hypotheses and purposeful sampling methods [19–23]. In this new era of BBig Data^, researchers have the opportunities and abilities to utilize the wealth of MBD that is available, which promises to provide a comprehensive view about complex human behaviors. With the assistance of cloud computing technologies, studies of human behaviors rapidly advanced. One objective of these studies has been to quantitatively uncover the inherent features of human behaviors and determine how these behaviors have evolved using big data analytics.

The universal usage of HISs, such as electronic medical records (EMRs), computerized physician order entry (CPOE), picture archiving and communication system (PACS), and other clinical and administrative systems, yields ultra-large volumes of data that are closely related and connected with users’ behaviors. Secondary uses of such MBD are becoming increasingly popular in healthcare services and clinical research [24–26]. One use of MBD is user behavior targeting (BT) [27, 28], which helps us to target medical practitioners’ habits and interests regarding how they use the HISs.

Harnessing collective intelligence and its related algorithms has been demonstrated to be a trustworthy method for studying medical data. Kim M et al. (2014) used collective intelligence to reduce the potential risk of misleading online information and the accompanying safety issues [29]. AlorHernandez G et al. (2012) developed a content-based image retrieval system to use collective intelligence; their system supports the medical community in providing differential diagnoses related to diseases of the breast [30].