Predicting Breast Cancer Survivability

A Comparison of Three Data Mining Methods

  • Omead I. Hussain Department of Banking and Financial Science, Cihan University-Erbil, Kurdistan Region, Iraq
Keywords: Predicting Breast Cancer, Data mining, SEER database, Artificial Neural Network

Abstract

this study concentrates on Predicting Breast Cancer Survivability using data mining, and comparing between three main predictive modeling tools. Precisely, we used three popular data mining methods: two from machine learning (artificial neural network and decision trees) and one from statistics (logistic regression), and aimed to choose the best model through the efficiency of each model and with the most effective variables to these models and the most common important predictor. We defined the three main modeling aims and uses by demonstrating the purpose of the modeling. By using data mining, we can begin to characterize and describe trends and patterns that reside in data and information. The preprocessed data set contents were of 87 variables and the total of the records are 457,389; which became 93 variables and 90308 records for each variable, and these dataset were from the SEER database. We have achieved more than three data mining techniques and we have investigated all the data mining techniques and finally we find the best thing to do is to focus about these data mining techniques which are Artificial Neural Network, Decision Trees and Logistic Regression by using SAS Enterprise Miner 5.2 which is in our view of point is the suitable system to use according to the facilities and the results given to us. Several experiments have been conducted using these algorithms. The achieved prediction implementations are Comparison-based techniques. However, we have found out that the neural network has a much better performance than the other two techniques. Finally, we can say that the model we chose has the highest accuracy which specialists in the breast cancer field can use and depend on.

Downloads

Download data is not yet available.

References

Agilent Technologies, Inc. (2005). Principal Component Analysis. Retrieved from: http://www.chem.agilent.com/cag/bsp/products/gsgx/downloads/pdf/pca.pdf. [Last accessed on 2019 Feb 15].

Allison, P. D. (2001). Logistic Regression Using the SAS System: Theory and Application. SAS Publishing. Retrieved from: http://www.books.google.co.uk/books. [Last accessed on 2018 Oct 16].

Allison’s, R. (2003). SAS/Graph Examples. Retrieved from: http://www.robslink.com. [Last accessed on 2018 Oct 07].

Aster, R. (2005). Professional SAS Programming Shortcuts. Retrieved from: http://www.globalstatements.com/shortcuts. [Last accessed on 2018 Nov 01].

Bellaachia, A., & Guven, E. (2005). Predicting Breast Cancer Survivability Using Data Mining Techniques, Department of Computer Science. Washington DC: The George Washington University.

Burke, H. B., Goodman, P. H., Rosen, D. B., Henson, D. E., Weinstein, J. N., Harrell, F. E Jr., Marks, J. R., Winchester, D. P., & Bostwick, D. G. (1997). Artificial neural networks improve the accuracy of cancer survival prediction. Cancer, 79, 857-862. Retrieved from: http://www.info.cancerresearchuk.org/cancerstats/types/breast/incidence. [Last accessed on 2018 Sep 15].

Chen, D. (2007). Decision Trees for Classification, in Lecture Notes in Dept of Info Systems and IT, PhD. Faculty of Business, Computing and Info Management. London: South Bank University.

Chow, M., Goode, P., Menozzi, A., Teeter, J., & Thrower, J. P. (1994). Bernoulli Error Measure Approach to Train Feed forward Artificial Neural Networks for Classification Problems, Department of Electrical and Computer Engineering. Raleigh, USA: North Carolina State University.

Coding Guidelines Breast C500-C509. (2007). SEER Program Coding and Staging Manual 2007, Coding Guidelines Breast C500-C509. Retrieved from:http://www.seeer.Cancer.gov. [Last accessed on 2018 Oct 14].

Delen, D., Walker, G., & Kadam, A. (2004). Predicting Breast Cancer Survivability: A Comparison of Three Data Mining Methods. Retrieved from: http://www.journals.elsevierhealth.com. [Last accessed on 2019 Aug 01].

Edwards, B. K., Howe, H. L., Lynn, A. G. R., Thun, M. J., Rosenberg, H. M., Yancik, R., Wingo, P. A., Jemal, A., & Feigal, E. G. (2002). Annual report to the nation on the status of Cancer, 1973-1999, featuring implications of age and aging on US Cancer burden. Cancer, 94, 2766-2792.

Han, J., & Kamber, M. (2001). Data Mining: Concepts and Techniques. Burlington: Morgan Kaufmann Publisher.

Holland, S. (2008). Principal Component Analysis. Retrieved from: http://www.uga.edu/~strata/software/pdf/pcaTutorial.pdf. [Last accessed on 2018 Dec 24].

Hosmer, W. D., & Lemeshow, S. (1994). Applied Logistic Regression. Wiley Series in Probability and Statistics Applied Probability and Statistics Section. Retrieved from: http://www.books.google.co.uk/books. [Last accessed on 2018 Oct 20].

Huang, J., Lu, J., & Ling, C. X. (2003). Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy. (pp. 553-556). 3rd IEEE International Conferenceon 19-22. Retrieved from: https://www.scirp.org/(S(351jmbntvnsjt1aadkposzje))/reference/ReferencesPapers.aspx?ReferenceID=783455. [Last accessed on 2018 Nov 03].

Intrator, O., & Intrator, N. (2001). Interpreting neural-network results: A simulation study. Computational statistics and Data Analysis, 37(3), 373-393.

Kates, R., Harbeck, N., & Schmitt, M. (2000). Prospects for Clinical Decision Support in Breast Cancer Based on Neural Network Analysis of Clinical Survival Data. Munich, Germany: IEEE.

McCue, C. (2007). Data Mining and Predictive Analysis (Intelligence Gathering and Crime Analysis). Oxford: Elsevier Inc.

Neville, P. (1999). Decision Trees for Predictive Modelling. Retrieved from http://www.sasenterpriseminer.com/documents/Decision%20Trees%20for%20Predictive%20Modeling.pdf. [Last accessed on 2018 Dec 25].

SEER Program Code Manual. (1998). SEER Geocodes for Coding Place of Birth. 3rd ed. Retrieved from: http://www.seeer.cancer.gov. [Last accessed on 2018 Oct 13].

SEER Program Code Manual. (1998). Tow-digit Site Specific Surgery Codes (1983-1997). 3rd ed. Retrieved from: http://www.seeer.cancer.gov. [Last accessed on 2018 Oct 16].

SEER Program Quality Control Section, Suite 504. (2007). ICD-0-3 Seer Site/Histology Validation. Retrieved from: http://www.seeer.cancer.gov. [Last accessed on 2019 Oct 19].

The Basics of SAS Enterprise Miner 5.2. (2018). Retrieved from: http://www.support.sas.com/publishing/pubcat/chaps/59829.pdf. [Last accessed on 2018 Oct 06].

Vesset, D., & Chua, C. K. (2017). IDC’s Worldwide Big Data and Analytics Software Taxonomy. (pp. 1-14). North Korea: Big Data.

Witten, I. H., and Frank, E. (2005). Data Mining, Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco: Elsevier Inc.

Published
2020-02-10
How to Cite
Hussain, O. (2020). Predicting Breast Cancer Survivability. Cihan University-Erbil Journal of Humanities and Social Sciences, 4(1), 17-30. https://doi.org/10.24086/cuejhss.v4n1y2020.pp17-30
Section
Articles