Breast Cancer Classification: Comparative Performance Analysis of Image Shape-Based Features and Microarray Gene Expression DataIJBSBT


Ahmed Fawzi Otoom, Emad E. Abdallah, Maen Hammad
Biotechnology / Bioengineering / Artificial Intelligence / Biomedical Engineering


Feature Selection and Ranking of Key Genes for Tumor Classification: Using Microarray Gene Expression Data

Srinivas Mukkamala, Qingzhong Liu, Rajeev Veeraghattam, Andrew H. Sung

Suite of decision tree-based classification algorithms on cancer gene expression data

Mohmad Badr Al Snousy, Hesham Mohamed El-Deeb, Khaled Badran, Ibrahim Ali Al Khlil

A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, S. Levy

Classification models based-on incremental learning algorithm and feature selection on gene expression data

Patharawut Saengsiri, Phayung Meesad, Sageemas Na Wichian, Unger Herwig

A public repository for DNA microarray-based gene expression data

Alvis Brazma, Alan Robinson, Jaak Vilo


International Journal of Bio-Science and Bio-Technology

Vol.7, No.2 (2015), pp.37-46

ISSN: 2233-7849 IJBSBT

Copyright ⓒ 2015 SERSC

Breast Cancer Classification: Comparative Performance Analysis of Image Shape-Based Features and Microarray Gene Expression


Ahmed Fawzi Otoom 1 , Emad E. Abdallah 2 and Maen Hammad 3 1,2,3

Faculty of Prince Al-Hussein Bin Abdullah II for Information Technology

The Hashemite University, Zarqa, Jordan { 1 aotoom, 2 emad, 3 mhammad},


Recently, there has been greater attention to the use of classifier systems in medical diagnosis. Medical diagnostic tools provide automated procedures for objective decisions by making use of quantitative measures and machine learning techniques. These tools are effective and helpful for medical experts to diagnose diseases. One of such diseases is breast cancer which is the second largest cause of cancer deaths among women. To build an intelligent tool, it is very important to have an effective set of features. Two types of feature sets have been commonly implemented for the purpose of breast cancer diagnosis: image shape-based features and microarray gene expression data. Both types of feature sets have been widely implemented; however, there has been no work that directly compared the classification performance of these two feature sets. In this paper, we intensively review related works that used both types of feature sets and we also review the implemented machine learning algorithms. Moreover, we run extensive experiments to compare the classification performance of the aforementioned feature sets. Our results show that the image shape-based features are more discriminative for breast cancer classification when tested with ten-fold cross validation. To check the robustness of the best performing feature set, we further examine it with five-fold cross validation and with a variety of generative classification algorithms.

Keywords: Breast cancer, tumor image shape-based features, microarray gene expression data, generative classifiers, discriminative classifiers, machine learning. 1. Introduction and Related Work

Cancer is a general name for a group of more than 100 diseases. It starts when cells in a part of the body start to grow out of control. A widely spread type of cancer is breast cancer. According to the American Cancer Society, in 2013, around 39,620 women were expected to die from it, in the US alone, coming in the second place after lung cancer for cancer deaths in women [5]. Breast cancer is a malignant tumor that starts in the cells of the breast. Malignant tumor is a group of cancer cells that can invade surrounding tissues or can spread to distant areas of body. It is important to note that not all tumors are cancer. Tumors that are not cancer are called benign [5]. Early detection of breast cancer can save thousands of lives each year and can help in treating it successfully. This was the major motive for us to study intelligent systems that can provide early diagnosis of the disease. Recently, there has been growing attention to the areas of medical informatics and machine learning. A lot of research is conducted towards the development of diagnostic tools that are designed to support medical experts (E.g. [2], [9-11], [13-14], [18-19]). Machine learning techniques are used to provide second view in analysis of medical data with the incorporation of knowledge-based approaches [11]. It can afford help for medical experts to examine medical data in a shorter time. Moreover, it can provide early diagnosis of medical diseases and prompt an early action to be taken to save

International Journal of Bio-Science and Bio-Technology

Vol.7, No.2 (2015) 38 Copyright ⓒ 2015 SERSC patients' lives. Early detection of the disease is an important issue in many medical problems and, in specific, breast cancer as early discovery of the disease means improving the chance of successful treatment.

In this work, we target the problem of breast cancer diagnosis which is a classification problem. In any classification problem, there are two main stages: feature set extraction and classification.Two commonly applied types of feature sets for breast cancer classification are: image shape-based features and microarray gene expression features.

Generally, image-based features are extracted from images based on edge information, local regions, or based on measurements of pixel intensities of the object(s) in the image [3]. For breast cancer data sets, features are usually computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. FNA is a diagnostic procedure used to investigate lumps or masses under the skin. It involves fluid extraction from a breast mass using a small needle and then it is visually inspected under the microscope [11]. These features are related to the shape of the cell nuclei present in the image. A commonly source for these features is the Wisconsin Diagnostic Breast Cancer (WDBC) data set which is publicly available from the UCI repository [1]. In this data set, thirty features are computed for each cell nucleus in which they provide a description for the characteristics of the cell nucleus. These features are calculated using active contour model or snakes framework. An example of these features is radius which is measured by averaging the length of the radial line segments defined by the centroid of the snake and the individual snake points. Another example is the compactness feature which is calculated by dividing the square of the perimeter to the area. The total distance between consecutive snake points represents the nucleus perimeter and the area is measured by counting the number of pixels on the interior of the snake and adding one-half of the pixels on the parameter [10]. WDBC has been widely used in the pattern recognition community [e.g. [8-11] ].

Another commonly applied type of features is gene expression features [e.g. [2], [14], [16], [18]]. Gene expression patterns are captured using cDNA microarrays. cDNA microarrays are novel biotechnologies increasingly used in the cancer research. They allow the monitoring of expression levels of thousands of genes concurrently. Therefore, they can lead to more understanding of the molecular variations among tumors and lead to a more accurate classification [11]. cDNA microarrays consist of thousands of individual