Simple Breast Cancer Classification using different classifiers in Julia with python integration for beginners

5 min readMay 6, 2021

With Julia gaining popularity within the Data science and Machine learning community, I decided to make a simple project to prove that switching from python to Julia is not that hard.

What is Cancer?

Cancer is a disease in which cells in the body grow out of control. Except for skin cancer, breast cancer is the most common cancer in women in the United States. Deaths from breast cancer have declined over time, but remain the second leading cause of cancer death among women overall and the leading cause of cancer death among Hispanic women 1.

Tumors can be of two types, Malignant (Cancerous) or Benign (Non-Cancerous), in this report we will look at different Machine learning algorithms and analyze how they work against each other on the famous breast cancer dataset.

Who is at more risk?

Breast cancer is not a transmissible or infectious disease. Unlike some cancers that have infection-related causes, such as human papillomavirus (HPV) infection and cervical cancer, there are no known viral or bacterial infections linked to the development of breast cancer.

Signs and symptoms

Generally, symptoms of breast cancer include:

a breast lump or thickening;
alteration in size, shape or appearance of a breast;
dimpling, redness, pitting or other alteration in the skin;
change in nipple appearance or alteration in the skin surrounding the nipple (areola); and/or
abnormal nipple discharge. 2

Let’s talk about our dataset

Get the dataset here

Source: KPMG virtual internship at forage

The dataset is a classic binary classification to figure out whether a tumor is Benign or Malignant. The dataset comprises of 33 columns which mostly comprise of the dimensions and texture of the tumors of each person, namely:

➢ radius (mean of distances from center to points on the perimeter)
➢ texture (standard deviation of gray-scale values)
➢ perimeter
➢ area
➢ smoothness (local variation in radius lengths)
➢ compactness (perimeter² / area — 1.0)
➢ concavity (severity of concave portions of the contour)
➢ concave points (number of concave portions of the contour)
➢ symmetry
➢ fractal dimension (“coastline approximation” — 1)
➢ Diagnosis: either M (Malignant) or B(Benign)

So many columns but are they all important?
we can figure that out in the following analysis!

Importing important modules

Julia supports a lot of python modules including tensorflow!👀

using ScikitLearn
using Pandas
using Seaborn
ENV["COLUMNS"] = 1000;

The dataset had a null column, which had to be removed

df = drop(df,columns = "Unnamed: 32" );

Class distribution: 357 benign, 212 malignant

corr_mat = corr(df1)["diagnosis_M"];
print(corr_mat)

Correlation between various fields and Malignant Diagnosis

we divide the dataset into two parts, one with all the columns as features and one with only the highest correlation with our target to uncover interesting insights.

Important/highly related columns that could be considered as features:

➢radius_mean 0.730029*
➢perimeter_mean 0.742636
➢area_mean 0.708984
➢concavity_mean 0.696360
➢concave points_mean 0.776614
➢radius_worst 0.776454
➢perimeter_worst 0.782914
➢area_worst 0.733825
➢concavity_worst 0.659610
➢concave points_worst 0.793566

Heatmap

A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions.

Visualize the relations between the highest correlated columns.

The heatmap shows important information of how effective these parameters are on our target values, the higher the better.

Pairplot

Seaaborn is a powerful package even in julia.

Visualizing some features against each other with respect to Diagnosis:

With these plots we confirm that the fields are highly related to the diagnosis.

Results of the Machine Learning Models

**Random Forest (Selective features) vs Random Forest (Complete features)**

**Decision Tree (Selective features) vs Decision Tree (Complete features)**

**SVM (Selective features) vs SVM(Complete features)**

**AdaBoost (Selective features) vs AdaBoost (Complete features)**

**Logistic Regression (Selective) Logistic Regression (Complete)**

Conclusion

**Model with selected features (Selective) vs model with all fields/columns as features (Complete)**

The model with all the fields as features performs better than the one with only the highly correlated fields with all the classifiers except Decision Tree.
All models perform well in the selective while only a few of them perform well in the Complete model.
AdaBoost, SVM and Random forest seem to be the most reliable algorithms to train the model, delivering accuracies of 99%-95% constantly in both the models.

Github repo: https://github.com/HarshitSati/Breast_cancer_Julia

References and Further reading

Source: Division of Cancer Prevention and Control, Centers for Disease Control and Prevention — https://www.cdc.gov/cancer/breast/basic_info/index.html.
Source: WHO
Dr. William H. Wolberg, W. Nick Street,Olvi L. Mangasarian (1995). UCI Machine Learning Repository[https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29]University of Wisconsin.