An Elastic Net Approach to Logistic Regression for Genetic Selection in High-Dimensional Brain Cancer Data

Authors

DOI:

https://doi.org/10.24086/cuesj.v9n1y2025.pp14-23

Keywords:

Brian Cancer, Elastic net, Regularization Techniques, Gene Selection, Multinomial Logistic Model, High-Dimensional Data

Abstract

The study explores issues related to the treatment of brain cancer caused by the heterogeneous nature of different variants of brain tumors. The objective of this study was to identify essential genes present in multiple types of brain cancer by using high dimensional gene expression data available on the Curated Microarray Database CuMiDa. The study’s dataset comprised a total of 130 samples belonging to 4 subtypes of brain cancer and 16384 gene expression variables. Thus, the penalized Elastic net method in conjunction with Multinomial Logistic Regression was used to cope with curse of dimensionality problems. Then, accuracy, Kappa statistic, Area Under the Curve, and F1-score were utilized to evaluate measures of the model efficiency. Elastic Net proved to be quite effective in the sense of the extensiveness of the variables included in the analysis and successfully restricted gene level further analysis as well as highlighted subtype specific expression signatures. The model achieved high precision and AUC values indicating that in general the model had good ability to distinguish all subtypes with some around perfect score of AUC. Robust parameter estimation was supplemented with cross validation and other predictive model validation statistical techniques done in R language programming. Thus, these findings suggest that the best model for evaluating large-scale gene expression data of brain cancers is the use of MLR with an elastic net regularization. There is ample evidence that these selected genes contribute to and serve as targets for therapy, therefore making this study a good starting point for further investigations with respect to understanding their biological role. The corresponding model is also to be applied to test its validity on some other datasets of a quite different nature. This, in turn, may suggest improved diagnostic, prognostic and therapeutic options for the brain tumor.

Downloads

Download data is not yet available.

Author Biographies

Nozad H. Mahmood, Department of Statistics and Information, College of Administration and Economics, Salahaddin University-Erbil, Erbil, Iraq

Nozad H. Mahmood is currently a full-time lecturer and the Head of the Business Administration Department at Cihan University-Sulaimaniya. He is also the director of statistical consulting for data analysis and training. He received his MS in Statistical Computing from the University of Central Florida in the US and a B.Sc. in Statistics from Salahaddin University-Erbil. His scholarly interests and expertise include variable selection, regularized regression, clustering, data mining and classification, dimension reduction, categorical data analysis, and experimental design.

Dler H. Kadir, Department of Statistics and Information, College of Administration and Economics, Salahaddin University-Erbil, Iraq

Dler H. Kadir is currently a full-time Assistant Professor at the Department of Statistics, College of Administrative and Economics at Salaheddin University-Erbil. His research interests include Bayesian inference, MCMC, and Statistical Modeling.

References

D. N. Louis, A. Perry, P. Wesseling, D. J. Brat, I. A. Cree, D. Figarella-Branger, C. Hawkins, H. K. Ng, S. M. Pfister, G. Reifenberger, R. Soffietti, A. Von Deimling and D. W. Ellison. The 2021 WHO classification of tumors of the central nervous system: A summary. Neuro-Oncology, vol. 23, no. 8, pp. 1231-1251, 2021.

Q. T. Ostrom, N. Patil, G. Cioffi, K. Waite, C. Kruchko and J. S. Barnholtz-Sloan. CBTRUS statistical report: Primary brain and other central nervous system tumors diagnosed in the United States in 2013-2017. Neuro-Oncology, vol. 22, no. Suppl 1, pp.iv1-iv96, 2020.

Y. Ma and Z. Xi. Integrated analysis of multiomics data identified molecular subtypes and oxidative Stress-Related prognostic biomarkers in Glioblastoma multiforme. Oxidative Medicine and Cellular Longevity, vol. 2022, pp. 1-15, 2022.

M. Ahdesmäki and K. Strimmer. Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. The Annals of Applied Statistics, vol. 4, no. 1, pp. 503-5192010.

T. Hastie, T. Robert and J. Friedman. The elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. 2nded., Vol. 10. Springer, Germany; 2009. p. 0387848576.

D. W. Hosmer Jr., S. Lemeshow and R. X. Sturdivant. Applied Logistic Regression. Wiley, United States; 2013.

Pearson Deutschland. Econometric Analysis. Pearson eLibrary; 2019. Available from: https://elibrary.pearson.de/ book/99.150005/9781292231150 [Last accessed on 2024 Jun10].

N. Mahmood, R. Yahya and S. Aziz. Apply binary logistic regression model to recognize the risk factors of diabetes through measuring glycated hemoglobin levels. CUESJ, vol. 6, no. 1, pp. 7-11, 2022.

P. Bühlmann and S. Van De Geer. Statistics for High-Dimensional Data. Springer, Germany, 2011.

K. P. Vatcheva, M. Lee, J. B. McCormick and M. H. Rahbar. Multicollinearity in regression analyses conducted in epidemiologic studies. Epidemiology (Sunnyvale), vol. 6, no. 2, p. 227, 2016.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Statistical Methodology), vol. 58, no. 1, pp. 267-288, 1996.

N. H. Mahmood, D. H. Kadir, R. O. Yahya and H. Q. Birdawod. The significance of delivery methods and fetal gender in reducing stillbirth rate: Using the generalized regression model. Clinical Epidemiology and Global Health, vol. 29, p. 101710, 2024.

J. Friedman, T. Hastie and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, vol. 33, no. 1, pp. 1-22, 2010.

J. O. Ogutu, T. Schulz-Streeck and H. P. Piepho. Genomic selection using regularized linear regression models: Ridge regression, lasso, elastic net and their extensions. BMC Proceedings, vol. 6, no. S2, p. S10, 2012.

N. Mahmood. Sparse Ridge Fusion for Linear Regression. STARS, 2013. Available from: https://stars.library.ucf.edu/etd/2767 [Last accessed on 2024 Jul 03].

T. Hastie, R. Tibshirani and M. Wainwright. Statistical Learning with Sparsity. CRC Press, United States, 2015.

M. Ceccarelli, F. P. Barthel, T. M. Malta, T. S. Sabedot, S. R. Salama, B. A. Murray, T. S. Sabedot, B. A. Murray, O. Morozova,… & Y. Newton. Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell, vol. 164, no. 3, pp. 550-563, 2016.

C. Neftel, J. Laffy, M. G. Filbin, T. Hara, M. E. Shore, G. J. Rahme, A. R. Richman, M. E. Shoreet and G. J. Rahmeal. An integrative model of cellular states, plasticity, and genetics for glioblastoma. Cell, vol. 178, no. 4, pp. 835-849.e21, 2019.

Y. Zhang, P. K. S. Ng, M. Kucherlapati, F. Chen, T. Liu, Y. H. Tsang, G. De Velasco, K. J. Jeong and R. Akbani. A pan-cancer proteogenomic atlas of PI3K/AKT/MTOR pathway alterations. Cancer Cell, vol. 31, no. 6, pp. 820-832.e3, 2017.

G. James, D. Witten, T. Hastie and R. Tibshirani. An Introduction to Statistical Learning. Springer, Germany, 2021.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91-108, 2005.

J. A. M. Pérez and P. S. P. Martín. Regresión logística. Medicina De Familia Semergen, vol. 50, no. 1, p. 102086, 2023.

N. H. Mahmood, S. H. Murad and K. K. Kakamad. Ordinal logistic regression for students academic performance in Kurdistan region of Iraq. Information Management and Business Review, vol. 10, no. 2, pp. 17-22, 2018.

J. E. Yoo. Penalized Regression in Large-Scale Data Analysis. Springer, Singapore, pp. 71-91, 2024.

C. Wang, N. Li, H. Diao and L. Lu. Variable selection through adaptive elastic net for proportional odds model. Japanese Journal of Statistics and Data Science, vol. 7, no. 1, pp. 203-221, 2024.

L. Liu, J. Gao, G. Beasley and S. H. Jung. LASSO and elastic net tend to over-select features. Mathematics, vol. 11, no. 17, p. 3738, 2023.

J. Balayla. Prevalence Threshold and bounds in the Accuracy of Binary Classification Systems. Cornell University, New York, 2021.

P. Christen, D. J. Hand and N. Kirielle. A review of the F-measure: Its history, properties, criticism, and alternatives. ACM Computing Surveys, vol. 56, no. 3, pp. 1-24, 2023.

M. Weller, W. Wick, K. Aldape, M. Brada, M. Berger, S. M. Pfister, R. Nishikawa, M. Rosenthal, P. Y. Wen, R. Stupp and G. Reifenberger. Glioma. Nature Reviews Disease Primers, vol. 1, no. 1, p. 15017, 2015.

Published

2025-01-20

How to Cite

1.
Mahmood NH, Kadir DH. An Elastic Net Approach to Logistic Regression for Genetic Selection in High-Dimensional Brain Cancer Data. Cihan U Erbil SCI J [Internet]. 2025 Jan. 20 [cited 2026 Jun. 23];9(1):14-23. Available from: https://journals.cihanuniversity.edu.iq/index.php/cuesj/article/view/1338

Issue

Section

Research Article
Received 2024-11-21
Accepted 2024-12-20
Published 2025-01-20

Most read articles by the same author(s)

1 2 > >> 

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.