Simple Breast Cancer Classification using different classifiers in Julia with python integration for beginners
With Julia gaining popularity within the Data science and Machine learning community, I decided to make a simple project to prove that switching from python to Julia is not that hard.
What is Cancer?
Cancer is a disease in which cells in the body grow out of control. Except for skin cancer, breast cancer is the most common cancer in women in the United States. Deaths from breast cancer have declined over time, but remain the second leading cause of cancer death among women overall and the leading cause of cancer death among Hispanic women 1.
Tumors can be of two types, Malignant (Cancerous) or Benign (Non-Cancerous), in this report we will look at different Machine learning algorithms and analyze how they work against each other on the famous breast cancer dataset.
Who is at more risk?
Breast cancer is not a transmissible or infectious disease. Unlike some cancers that have infection-related causes, such as human papillomavirus (HPV) infection and cervical cancer, there are no known viral or bacterial infections linked to the development of breast cancer.
Signs and symptoms
Generally, symptoms of breast cancer include:
- a breast lump or thickening;
- alteration in size, shape or appearance of a breast;
- dimpling, redness, pitting or other alteration in the skin;
- change in nipple appearance or alteration in the skin surrounding the nipple (areola); and/or
- abnormal nipple discharge. 2
Let’s talk about our dataset
Get the dataset here
The dataset is a classic binary classification to figure out whether a tumor is Benign or Malignant. The dataset comprises of 33 columns which mostly comprise of the dimensions and texture of the tumors of each person, namely:
➢ radius (mean of distances from center to points on the perimeter)
➢ texture (standard deviation of gray-scale values)
➢ perimeter
➢ area
➢ smoothness (local variation in radius lengths)
➢ compactness (perimeter² / area — 1.0)
➢ concavity (severity of concave portions of the contour)
➢ concave points (number of concave portions of the contour)
➢ symmetry
➢ fractal dimension (“coastline approximation” — 1)
➢ Diagnosis: either M (Malignant) or B(Benign)
So many columns but are they all important?
we can figure that out in the following analysis!
Importing important modules
Julia supports a lot of python modules including tensorflow!👀
using ScikitLearn
using Pandas
using Seaborn
ENV["COLUMNS"] = 1000;
The dataset had a null column, which had to be removed
df = drop(df,columns = "Unnamed: 32" );
corr_mat = corr(df1)["diagnosis_M"];
print(corr_mat)
we divide the dataset into two parts, one with all the columns as features and one with only the highest correlation with our target to uncover interesting insights.
Important/highly related columns that could be considered as features:
➢radius_mean 0.730029*
➢perimeter_mean 0.742636
➢area_mean 0.708984
➢concavity_mean 0.696360
➢concave points_mean 0.776614
➢radius_worst 0.776454
➢perimeter_worst 0.782914
➢area_worst 0.733825
➢concavity_worst 0.659610
➢concave points_worst 0.793566
Heatmap
A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions.
The heatmap shows important information of how effective these parameters are on our target values, the higher the better.
Pairplot
Visualizing some features against each other with respect to Diagnosis:
Results of the Machine Learning Models
Conclusion
- The model with all the fields as features performs better than the one with only the highly correlated fields with all the classifiers except Decision Tree.
- All models perform well in the selective while only a few of them perform well in the Complete model.
- AdaBoost, SVM and Random forest seem to be the most reliable algorithms to train the model, delivering accuracies of 99%-95% constantly in both the models.
Github repo: https://github.com/HarshitSati/Breast_cancer_Julia
References and Further reading
- Source: Division of Cancer Prevention and Control, Centers for Disease Control and Prevention — https://www.cdc.gov/cancer/breast/basic_info/index.html.
- Source: WHO
- Dr. William H. Wolberg, W. Nick Street,Olvi L. Mangasarian (1995). UCI Machine Learning Repository[https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29]University of Wisconsin.