Breast Cancer Data Mining
Autor: fluff26 • January 22, 2018 • Research Paper • 909 Words (4 Pages) • 836 Views
Summary
Breast cancer is the second leading cause of death in women today, and is the most common type of cancer prevalent in developed countries. As breast cancer recurrence is high, good and early diagnosis is important to reduce the fatality count later on. Many studies have been conducted to analyse breast cancer data. In this project, we explore the applicability of K-Nearest Neighbours to determine whether the cancer is benign or malignant.
Problem statement
The objective of our project is to implement a classification model for breast cancer data. Based on the feature set, the model has to assign the patient to either a ‘benign’ group that does not have breast cancer, or a ‘malignant’ group which has strong evidence of the cancer.
Motivation and novelty
Although cancer research is generally clinical and/or biological in nature, data driven statistical research is becoming a common complement. This has given rise to automated diagnostic systems. The aim of these studies is to assist doctors in making diagnostic decisions. Predicting the outcome of a disease is one of the most interesting and challenging tasks to which data mining techniques are being applied. Survival analyses is a field in medical prognosis that deals with application of various methods to historic data in order to predict the survival of a particular patient suffering from a disease over a particular time period.
Our work concentrates on the breast cancer Wisconsin dataset. Prior work has been done on the very same dataset as given in [1, 2, 3]. These papers have evaluated the performance of classification models such as Naïve Bayesian, Logistic Regression, ID3, C4.5, and neural networks. However, the average accuracy of these models is around 95%. The combination of the sensitivity and serious effects of breast cancer, and the promising results of this related research, has motivated us to investigate the suitability of the kNN (k-Nearest Neighbours) algorithm in the breast cancer domain.
Approach
Dataset description
In this project, we have taken the Breast Cancer Wisconsin (Diagnostic) dataset [6] from the UCI machine learning repository as our input data. The dataset has 699 instances, with 10 attributes [excluding the ID attribute]. The dependent class attribute has two possible values – Benign (represented by a value of 2), and Malignant (represented by a value of 4). The rest of the variables are graded on an interval scale of 1 to 10, where 1 denotes a normal state [minimum possibility for the person to get breast cancer], and 10 denotes the most abnormal state [maximum possibility of cancer].
...