Skip to main content
Log in

Feature Importance Genes from Breast Cancer Subtypes Classification Employing Machine Learning

Russian Journal of Genetics Aims and scope Submit manuscript

Cite this article

Abstract

The heterogeneous nature of breast cancer necessitates exploring its molecular subtypes for the early prognosis and treatment of cancer patients. Recent advances in genomics have enabled the investigation of gene expression data in breast cancer research as an alternative to traditional methods. In this regard, a project like The Cancer Genome Atlas (TCGA) provided easy access to the vast high-throughput sequencing gene expression data, including Breast cancer. However, finding evidence of the involvement of a set of genes in a particular breast cancer subtype from this large bulk of gene expression dataset is a demanding task. Here, we propose to develop a classification model based on machine learning to uncover the significant genes associated with different breast cancer subtypes like Basal, human epidermal growth factor receptor 2, luminal A, and luminal B. The RNA-Sequence gene expression data from The Cancer Genome Atlas is used for the tumor and normal sample classification and breast cancer subtype-specific optimal set of gene identification for this experiment. Experimental results show that the average classification accuracy value for different gene subsets varies from 75.36–77.74% depending upon the breast cancer subtype and feature selection method. Additionally, the feature scoring mechanism introduced in our model ranks the Feature Importance genes as three*, four*, five*, and six*. Besides this, Kaplan–Meier survival analysis, Composite network analysis, and Gene Ontology analysis are conducted to highlight the biological significance of the Feature Importancegenes. Given the classification results and the biological insight, we may conclude that the proposed model extracts a set of informative genes involved in breast cancer development, particularly the Basal, human epidermal growth factor receptor 2, luminal A, and luminal B subtypes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Institutional subscriptions

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.

REFERENCES

  1. Miller, K.D., Ortiz, A.P., Pinheiro, P.S., et al., Cancer statistics for the US Hispanic/Latino population, 2021, CA: Cancer J. Clin., 2021, vol. 71, no. 6, pp. 466—487.

    PubMed  Google Scholar 

  2. Sørlie, T., Perou, C.M., Tibshirani, R., et al., Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications, Proc. Natl. Acad. Sci. U.S.A., 2001, vol. 98, no. 19, pp. 10869—10874.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Hu, Z., Fan, C., Oh, D.S., et al., The molecular portraits of breast tumors are conserved across microarray platforms, BMC Genomics, 2006, vol. 7, no. 1, pp. 1—12.

    Article  Google Scholar 

  4. Parker, J.S., Mullins, M., Cheang, M.C., et al., Supervised risk predictor of breast cancer based on intrinsic subtypes, J. Clin. Oncol., 2009, vol. 27, no. 8, p. 1160.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Holm, J., Eriksson, L., Ploner, A., et al., Assessment of breast cancer risk factors reveals subtype heterogeneity subtype heterogeneity for breast cancer risk factors, Cancer Res., 2017, vol. 77, no. 13, pp. 3708—3717.

    Article  CAS  PubMed  Google Scholar 

  6. Dieci, M.V., Orvieto, E., Dominici, M., et al., Rare breast cancer subtypes: histological, molecular, and clinical peculiarities, Oncologist, 2014, vol. 19, no. 8, pp. 805—813.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Van’t Veer, L.J., Dai, H., Van De Vijver, M.J., et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature, 2002, vol. 415, no. 6871, pp. 530—536.

    Article  PubMed  Google Scholar 

  8. Wang, Y., Klijn, J.G., Zhang, Y., et al., Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer, Lancet, 2005, vol. 365, no. 9460, pp. 671—679.

    Article  CAS  PubMed  Google Scholar 

  9. Van De Vijver, M.J., He, Y.D., Van’t Veer, L.J., et al., A gene-expression signature as a predictor of survival in breast cancer, N. Eng. J. Med., 2002, vol. 347, no. 25, pp. 1999—2009.

    Article  CAS  Google Scholar 

  10. Lin, P., He, R.Q., Dang, Y.W., et al., An autophagy-related gene expression signature for survival prediction in multiple cohorts of hepatocellular carcinoma patients, Oncotarget, 2018, vol. 9, no. 25, p. 17368.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Ma, W., Zhao, F., Yu, X., et al., Immune-related lncRNAs as predictors of survival in breast cancer: a prognostic signature, J. Transl. Med., 2020, vol. 18, no. 1, pp. 1—13.

    Article  Google Scholar 

  12. Xu, M., Li, Y., Li, W., et al., Immune and stroma related genes in breast cancer: a comprehensive analysis of tumor microenvironment based on the cancer genome atlas (TCGA) database, Front. Med., 2020, vol. 7, no. 64.

  13. Montazeri, M., Montazeri, M., Montazeri, M., and Beigzadeh, A., Machine learning models in breast cancer survival prediction, Tech. Health Care, 2016, vol. 24, no. 1, pp. 31—42.

    Article  Google Scholar 

  14. Wu, T., Sultan, L.R., Tian, J., et al., Machine learning for diagnostic ultrasound of triple-negative breast cancer, Breast Cancer Res. Treat., 2019, vol. 173, no. 2, pp. 365—373.

    Article  CAS  PubMed  Google Scholar 

  15. Turkki, R., Byckhov, D., Lundin, M., et al., Breast cancer outcome prediction with tumour tissue images and machine learning, Breast Cancer Res. Treat., 2019, vol. 177, no. 1, pp. 41—52.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Chen, Y., Li, Z.Y., Zhou, G.Z., et al., An immune-related gene prognostic index for head and neck squamous cell carcinoma IRGPI as an immune-related prognostic biomarker in HNSCC, Clin. Cancer Res., 2021, vol. 27, no. 1, pp. 330—341.

    Article  PubMed  Google Scholar 

  17. Mao, W., Wang, K., Xu, B., et al., ciRS-7 is a prognostic biomarker and potential gene therapy target for renal cell carcinoma, Mol. Cancer, 2021, vol. 20, no. 1, pp. 1—7.

    Google Scholar 

  18. Cortes, C., and Vapnik, V., Support-vector networks, Mach. Learn., 1995, vol. 20, no. 3, pp. 273—297.

    Article  Google Scholar 

  19. Nurdiawan, O., Kurnia, D., Solihudin, D., et al., Comparison of the K-Nearest Neighbor algorithm and the decision tree on moisture classification, IOP Conf. Ser.: Mater. Sci. Eng., 2021, vol. 1088, no. 1, pp. 012—031.

  20. Saritas, M.M., and Yasar, A., Performance analysis of ANN and Naive Bayes classification algorithm for data classification, Int. J. Intell. Syst. App. Eng., 2019, vol. 7, no. 2, pp. 88—91.

    Article  Google Scholar 

  21. Brijain, M., Patel, R., Kushik, M., et al., A survey on decision tree algorithm for classification, Int. J. Eng., Dev. Res., 2014, vol. 2, no. 1, pp. 1—5.

    Google Scholar 

  22. Biau, G., and Scornet, E., A random forest guided tour, Test, 2016, vol. 25, no. 2, pp. 197—227.

    Article  Google Scholar 

  23. Jakulin, A., Machine learning based on attribute interactions, Doctoral Dissertation, Univ. Ljubljani, 2005.

  24. Lin, D. and Tang, X., Conditional infomax learning: an integrated framework for feature extraction and fusion, Comp. Vision-ECCV 2006, Ser. Lec. Notes Comp. Sci., 2006, vol. 3951, pp. 68—82.

    Article  Google Scholar 

  25. Yang, H. and Moody, J., Feature selection based on joint mutual information, Proceedings of International ICSC Symposium on Advances in Intelligent Data Analysis, 1999, pp. 22—25.

  26. Peng, H., Long, F., and Ding, C., Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine International, 2005, pp. 1226—1238.

    Google Scholar 

  27. Battiti, R., Using mutual information for selecting features in supervised neural net learning, IEEE Trans. Neural Networks, 1994, vol. 5, no. 4, pp. 537—550.

    Article  CAS  PubMed  Google Scholar 

  28. Lewis, D.D., Feature selection and feature extraction for text categorization, Proceedings of Speech and Natural Language Workshop, Morgan Kaufmann, 1992, pp. 212—217.

  29. Robinson, M.D., McCarthy, D.J., and Smyth, G.K., edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, 2010, vol. 26, no. 1, pp. 139—140.

    Article  CAS  PubMed  Google Scholar 

  30. Kannan, S.S. and Ramaraj, N., A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm, Knowl.-Based Syst., 2010, vol. 23, no. 6, pp. 580—585.

    Article  Google Scholar 

  31. Franz, M., Rodriguez, H., Lopes, C., et al., GeneMANIA update 2018, Nucleic Acids Res., 2018, vol. 46, no. W1, pp. W60—W64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Liu, N., Zhou, Y., and Lee, J.J., IPDfromKM: reconstruct individual patient data from published Kaplan—Meier survival curves, BMC Med. Res. Methodol., 2021, vol. 21, no. 1, pp. 1—22.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Consortium, G.O., The gene ontology resource: 20 years and still going strong, Nucleic Acids Res., 2019, vol. 47, no. D1, pp. D330—D338.

    Article  Google Scholar 

  34. Montojo, J., Zuberi, K., Rodriguez, H., et al., GeneMANIA cytoscape plugin: fast gene function predictions on the desktop, Bioinformatic, 2010, vol. 26, no. 22, pp. 2927—2928.

    Article  CAS  Google Scholar 

  35. Chatr-Aryamontri, A., Oughtred, R., Boucher, L., et al., The BioGRID interaction database: 2017 update, Nucleic Acids Res., 2017, vol. 45, no. D1, pp. D369—D379.

    Article  CAS  PubMed  Google Scholar 

  36. Barrett, T., Troup, D.B., Wilhite, S.E., et al., NCBI GEO: archive for high-throughput functional genomic data, Nucleic Acids Res., 2009, vol. 37, suppl. 1, pp. D885—D890.

    Article  CAS  PubMed  Google Scholar 

  37. Brown, K.R. and Jurisica, I., Online predicted human interaction database, Bioinformatics, 2005, vol. 21, no. 9, pp. 2076—2082.

    Article  CAS  PubMed  Google Scholar 

  38. Ge, S.X., Jung, D., and Yao, R., ShinyGO: a graphical gene-set enrichment tool for animals and plants, Bioinformatics, 2020, vol. 36, no. 8, pp. 2628—2629.

    Article  CAS  PubMed  Google Scholar 

  39. Bhowmick, S.S., Bhattacharjee, D., and Rato, L., Integrated analysis of the miRNA—mRNA next-generation sequencing data for finding their associations in different cancer types, Comput. Biol. Chem., 2020, vol. 84, pp. 107—152.

    Article  Google Scholar 

  40. Bhowmick, S.S., Bhattacharjee, D., and Rato, L., In silico markers: an evolutionary and statistical approach to select informative genes of human breast cancer subtypes, Genes Genomics, 2019, vol. 41, pp. 1371—1382.

    Article  PubMed  Google Scholar 

Download references

Funding

This research has not received any funding or research grants in the course of study, research, or assembly of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. S. Bhowmick.

Ethics declarations

The authors declare that they have no conflicts of interest. This article does not contain any studies involving animals or human participants performed by any of the authors.

Supplementary Information

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhowmick, S.S., Bhattacharjee, D. Feature Importance Genes from Breast Cancer Subtypes Classification Employing Machine Learning. Russ J Genet 59 (Suppl 1), 110–122 (2023). https://doi.org/10.1134/S1022795423130021

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1022795423130021

Keywords: