Abstract
The heterogeneous nature of breast cancer necessitates exploring its molecular subtypes for the early prognosis and treatment of cancer patients. Recent advances in genomics have enabled the investigation of gene expression data in breast cancer research as an alternative to traditional methods. In this regard, a project like The Cancer Genome Atlas (TCGA) provided easy access to the vast high-throughput sequencing gene expression data, including Breast cancer. However, finding evidence of the involvement of a set of genes in a particular breast cancer subtype from this large bulk of gene expression dataset is a demanding task. Here, we propose to develop a classification model based on machine learning to uncover the significant genes associated with different breast cancer subtypes like Basal, human epidermal growth factor receptor 2, luminal A, and luminal B. The RNA-Sequence gene expression data from The Cancer Genome Atlas is used for the tumor and normal sample classification and breast cancer subtype-specific optimal set of gene identification for this experiment. Experimental results show that the average classification accuracy value for different gene subsets varies from 75.36–77.74% depending upon the breast cancer subtype and feature selection method. Additionally, the feature scoring mechanism introduced in our model ranks the Feature Importance genes as three*, four*, five*, and six*. Besides this, Kaplan–Meier survival analysis, Composite network analysis, and Gene Ontology analysis are conducted to highlight the biological significance of the Feature Importancegenes. Given the classification results and the biological insight, we may conclude that the proposed model extracts a set of informative genes involved in breast cancer development, particularly the Basal, human epidermal growth factor receptor 2, luminal A, and luminal B subtypes.
This is a preview of subscription content,
to check access.Access this article
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.








REFERENCES
Miller, K.D., Ortiz, A.P., Pinheiro, P.S., et al., Cancer statistics for the US Hispanic/Latino population, 2021, CA: Cancer J. Clin., 2021, vol. 71, no. 6, pp. 466—487.
Sørlie, T., Perou, C.M., Tibshirani, R., et al., Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications, Proc. Natl. Acad. Sci. U.S.A., 2001, vol. 98, no. 19, pp. 10869—10874.
Hu, Z., Fan, C., Oh, D.S., et al., The molecular portraits of breast tumors are conserved across microarray platforms, BMC Genomics, 2006, vol. 7, no. 1, pp. 1—12.
Parker, J.S., Mullins, M., Cheang, M.C., et al., Supervised risk predictor of breast cancer based on intrinsic subtypes, J. Clin. Oncol., 2009, vol. 27, no. 8, p. 1160.
Holm, J., Eriksson, L., Ploner, A., et al., Assessment of breast cancer risk factors reveals subtype heterogeneity subtype heterogeneity for breast cancer risk factors, Cancer Res., 2017, vol. 77, no. 13, pp. 3708—3717.
Dieci, M.V., Orvieto, E., Dominici, M., et al., Rare breast cancer subtypes: histological, molecular, and clinical peculiarities, Oncologist, 2014, vol. 19, no. 8, pp. 805—813.
Van’t Veer, L.J., Dai, H., Van De Vijver, M.J., et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature, 2002, vol. 415, no. 6871, pp. 530—536.
Wang, Y., Klijn, J.G., Zhang, Y., et al., Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer, Lancet, 2005, vol. 365, no. 9460, pp. 671—679.
Van De Vijver, M.J., He, Y.D., Van’t Veer, L.J., et al., A gene-expression signature as a predictor of survival in breast cancer, N. Eng. J. Med., 2002, vol. 347, no. 25, pp. 1999—2009.
Lin, P., He, R.Q., Dang, Y.W., et al., An autophagy-related gene expression signature for survival prediction in multiple cohorts of hepatocellular carcinoma patients, Oncotarget, 2018, vol. 9, no. 25, p. 17368.
Ma, W., Zhao, F., Yu, X., et al., Immune-related lncRNAs as predictors of survival in breast cancer: a prognostic signature, J. Transl. Med., 2020, vol. 18, no. 1, pp. 1—13.
Xu, M., Li, Y., Li, W., et al., Immune and stroma related genes in breast cancer: a comprehensive analysis of tumor microenvironment based on the cancer genome atlas (TCGA) database, Front. Med., 2020, vol. 7, no. 64.
Montazeri, M., Montazeri, M., Montazeri, M., and Beigzadeh, A., Machine learning models in breast cancer survival prediction, Tech. Health Care, 2016, vol. 24, no. 1, pp. 31—42.
Wu, T., Sultan, L.R., Tian, J., et al., Machine learning for diagnostic ultrasound of triple-negative breast cancer, Breast Cancer Res. Treat., 2019, vol. 173, no. 2, pp. 365—373.
Turkki, R., Byckhov, D., Lundin, M., et al., Breast cancer outcome prediction with tumour tissue images and machine learning, Breast Cancer Res. Treat., 2019, vol. 177, no. 1, pp. 41—52.
Chen, Y., Li, Z.Y., Zhou, G.Z., et al., An immune-related gene prognostic index for head and neck squamous cell carcinoma IRGPI as an immune-related prognostic biomarker in HNSCC, Clin. Cancer Res., 2021, vol. 27, no. 1, pp. 330—341.
Mao, W., Wang, K., Xu, B., et al., ciRS-7 is a prognostic biomarker and potential gene therapy target for renal cell carcinoma, Mol. Cancer, 2021, vol. 20, no. 1, pp. 1—7.
Cortes, C., and Vapnik, V., Support-vector networks, Mach. Learn., 1995, vol. 20, no. 3, pp. 273—297.
Nurdiawan, O., Kurnia, D., Solihudin, D., et al., Comparison of the K-Nearest Neighbor algorithm and the decision tree on moisture classification, IOP Conf. Ser.: Mater. Sci. Eng., 2021, vol. 1088, no. 1, pp. 012—031.
Saritas, M.M., and Yasar, A., Performance analysis of ANN and Naive Bayes classification algorithm for data classification, Int. J. Intell. Syst. App. Eng., 2019, vol. 7, no. 2, pp. 88—91.
Brijain, M., Patel, R., Kushik, M., et al., A survey on decision tree algorithm for classification, Int. J. Eng., Dev. Res., 2014, vol. 2, no. 1, pp. 1—5.
Biau, G., and Scornet, E., A random forest guided tour, Test, 2016, vol. 25, no. 2, pp. 197—227.
Jakulin, A., Machine learning based on attribute interactions, Doctoral Dissertation, Univ. Ljubljani, 2005.
Lin, D. and Tang, X., Conditional infomax learning: an integrated framework for feature extraction and fusion, Comp. Vision-ECCV 2006, Ser. Lec. Notes Comp. Sci., 2006, vol. 3951, pp. 68—82.
Yang, H. and Moody, J., Feature selection based on joint mutual information, Proceedings of International ICSC Symposium on Advances in Intelligent Data Analysis, 1999, pp. 22—25.
Peng, H., Long, F., and Ding, C., Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine International, 2005, pp. 1226—1238.
Battiti, R., Using mutual information for selecting features in supervised neural net learning, IEEE Trans. Neural Networks, 1994, vol. 5, no. 4, pp. 537—550.
Lewis, D.D., Feature selection and feature extraction for text categorization, Proceedings of Speech and Natural Language Workshop, Morgan Kaufmann, 1992, pp. 212—217.
Robinson, M.D., McCarthy, D.J., and Smyth, G.K., edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, 2010, vol. 26, no. 1, pp. 139—140.
Kannan, S.S. and Ramaraj, N., A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm, Knowl.-Based Syst., 2010, vol. 23, no. 6, pp. 580—585.
Franz, M., Rodriguez, H., Lopes, C., et al., GeneMANIA update 2018, Nucleic Acids Res., 2018, vol. 46, no. W1, pp. W60—W64.
Liu, N., Zhou, Y., and Lee, J.J., IPDfromKM: reconstruct individual patient data from published Kaplan—Meier survival curves, BMC Med. Res. Methodol., 2021, vol. 21, no. 1, pp. 1—22.
Consortium, G.O., The gene ontology resource: 20 years and still going strong, Nucleic Acids Res., 2019, vol. 47, no. D1, pp. D330—D338.
Montojo, J., Zuberi, K., Rodriguez, H., et al., GeneMANIA cytoscape plugin: fast gene function predictions on the desktop, Bioinformatic, 2010, vol. 26, no. 22, pp. 2927—2928.
Chatr-Aryamontri, A., Oughtred, R., Boucher, L., et al., The BioGRID interaction database: 2017 update, Nucleic Acids Res., 2017, vol. 45, no. D1, pp. D369—D379.
Barrett, T., Troup, D.B., Wilhite, S.E., et al., NCBI GEO: archive for high-throughput functional genomic data, Nucleic Acids Res., 2009, vol. 37, suppl. 1, pp. D885—D890.
Brown, K.R. and Jurisica, I., Online predicted human interaction database, Bioinformatics, 2005, vol. 21, no. 9, pp. 2076—2082.
Ge, S.X., Jung, D., and Yao, R., ShinyGO: a graphical gene-set enrichment tool for animals and plants, Bioinformatics, 2020, vol. 36, no. 8, pp. 2628—2629.
Bhowmick, S.S., Bhattacharjee, D., and Rato, L., Integrated analysis of the miRNA—mRNA next-generation sequencing data for finding their associations in different cancer types, Comput. Biol. Chem., 2020, vol. 84, pp. 107—152.
Bhowmick, S.S., Bhattacharjee, D., and Rato, L., In silico markers: an evolutionary and statistical approach to select informative genes of human breast cancer subtypes, Genes Genomics, 2019, vol. 41, pp. 1371—1382.
Funding
This research has not received any funding or research grants in the course of study, research, or assembly of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declare that they have no conflicts of interest. This article does not contain any studies involving animals or human participants performed by any of the authors.
Supplementary Information
About this article
Cite this article
Bhowmick, S.S., Bhattacharjee, D. Feature Importance Genes from Breast Cancer Subtypes Classification Employing Machine Learning. Russ J Genet 59 (Suppl 1), 110–122 (2023). https://doi.org/10.1134/S1022795423130021
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1022795423130021