Predicting Breast Cancer with Machine Learning

Sophia Yang
Oct 12, 2023
6 min read

Updated: Apr 10

Breast cancer is a highly prevalent and potentially life-threatening disease that affects a significant number of individuals globally. As indicated by the World Health Organization (WHO) in 2020, breast cancer stands out as the most common cancer among women, with an alarming estimated 685,000 deaths in that year alone (WHO, 2020). The disease is notably complex and heterogeneous, presenting diverse tumor characteristics, which poses significant challenges for accurate diagnosis and appropriate treatment selection.

For many decades, the conventional approach to tumor classification has heavily relied on histopathological analysis. In this process, pathologists meticulously examine biopsy samples under a microscope to determine crucial tumor characteristics. Although this method has been considered the gold standard, it inherently suffers from subjectivity and can be influenced by the pathologist's experience and expertise, resulting in potential inter-observer variability in diagnosis (Snead, 2016). However, in recent years, there has been a burgeoning interest in exploring the application of machine learning algorithms to aid in the analysis of tumor characteristics in breast cancer cases. The integration of machine learning holds tremendous promise in this context, with the primary objective of enhancing tumor classification accuracy, mitigating diagnostic errors, and providing more personalized treatment strategies for breast cancer patients. The central focus of this research is to delve into the practical implementation of machine learning algorithms in analyzing tumor characteristics in breast cancer cases, with the ultimate goal of enabling precise classification between benign and malignant tumors.

Introduction

Breast cancer continues to pose a formidable challenge on a global scale, affecting the lives of millions and demanding concerted efforts in research and healthcare to advance diagnosis and treatment outcomes. Precise tumor classification, which differentiates between benign and malignant tumors, stands as a cornerstone in successfully managing breast cancer. Benign tumors, being non-cancerous growths that do not spread to other parts of the body, are generally considered less harmful. On the other hand, malignant tumors, also known as cancerous tumors, possess the capacity to invade surrounding tissues and metastasize to distant organs, underscoring the critical importance of early and accurate identification of malignancy for appropriate treatment and improved patient outcomes. Machine learning, a subset of artificial intelligence (AI), endows computers with the ability to learn from data, discern patterns, and make data-driven predictions or decisions. This transformative technology's versatile capabilities have led to its widespread adoption across various fields, including medical research and healthcare. Machine learning algorithms are well-suited for medical applications, especially when dealing with extensive and intricate datasets, such as patient information encompassing tumor characteristics, which require meticulous analysis for accurate diagnosis and treatment planning. These algorithms excel at identifying hidden patterns, relationships, and trends, which might prove arduous for human experts to discern due to the complexity and interdependencies within the data. Within the domain of breast cancer research, machine learning algorithms possess immense potential to significantly impact tumor classification, a pivotal aspect of diagnosis and treatment. By leveraging diverse datasets, these algorithms can effectively extract relevant features and construct predictive approximations in order to differentiate between benign and malignant breast tumors.

Methodology

The algorithm worked with a comprehensive breast cancer dataset comprising 570 samples, each representing a breast tumor with multiple pertinent tumor characteristics. These characteristics encompass essential features such as radius mean, texture mean, perimeter mean, area mean, smoothness mean, compactness mean, concavity mean, concave points mean, symmetry mean, and fractal dimension mean. Furthermore, the dataset includes standard error measurements for these tumor characteristics, along with the "worst" or largest mean values for each feature. The primary research question employs machine learning algorithms, with a specific emphasis on neural networks, to develop predictive models that accurately classify tumors as either benign or malignant. The implementation is coded in Python, utilizing TensorFlow to create the neural network model.

The dataset is preprocessed, separating input features (x) and target labels (y), where malignant tumors are denoted as 1 and benign tumors as 0. To assess the model's performance, the data is split into training (80%) and testing (20%) sets. During model compilation, the optimizer is specified as "adam" for efficient convergence during training, "binary_crossentropy" as the loss function, ideal for binary classification problems, and "accuracy" as the metric for evaluation. The model undergoes training for 1000 epochs, each adjusting its internal settings to minimize the loss function, aiming to improve its accuracy in classifying tumors as benign or malignant based on the training data. To focus solely on the tumor characteristics relevant to breast cancer classification, any outlying data is meticulously removed that is unrelated to tumor features. This process results in a refined dataset containing only the crucial tumor features essential for our analysis. Furthermore, to facilitate effective model training and prevent any bias towards specific features, we apply a standardization process to the dataset. Through standardization, we scale all feature values to lie within a common range, between 0 and 1. This ensures that the model considers all tumor characteristics on equal footing, thereby preventing any particular feature from dominating the classification process solely due to its scale.

Appendix

# Import necessary libraries

import pandas as pd

from sklearn.model_selection import train_test_split

import tensorflow as tf

# Load the breast cancer dataset from a CSV file

dataset = pd.read_csv('breast cancer.csv)

# Extract the input features (X) and target labels (y) from the dataset. The target labels represent the diagnosis: 1 for malignant and 0 for benign

x = dataset.drop(columns=["diagnosis(1=m, 0=b)"])

y = dataset["diagnosis(1=m, 0=b)"]

# Split the data into training and testing sets using the train_test_split function

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Create a Sequential model using the Keras API from TensorFlow.The Sequential model is a linear stack of layers, where data flows sequentially through each layer

model = tf.keras.models.Sequential()

# Add layers to the model. The first Dense layer has 256 units/neurons, and the input shape is determined by the number of features in x_train. The activation function used is 'sigmoid', which squashes the output between 0 and 1, making it suitable for binary classification

model.add(tf.keras.layers.Dense(256, input_shape=x_train.shape[1:], activation='sigmoid'))

# Another Dense layer with 256 units and 'sigmoid' activation function

model.add(tf.keras.layers.Dense(256, activation='sigmoid'))

# Final Dense layer with 1 unit and 'sigmoid' activation function, representing the output layer for binary classification

model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

# Compile the model by specifying the optimizer, loss function, and evaluation metrics. The 'adam' optimizer is an adaptive learning rate optimization algorithm. 'binary_crossentropy' is the loss function used for binary classification problems. The model will also track the accuracy metric during training

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model on the training data for 1000 epochs (iterations over the entire training data) # During training, the model will learn to minimize the loss function and improve classification accuracy

model.fit(x_train, y_train, epochs=1000)

# Evaluate the model's performance on the test data using the evaluate method # The evaluation_results will contain the loss value and accuracy achieved on the test data evaluation_results = model.evaluate(x_test, y_test)

Results

Accuracy is a fundamental evaluation metric that plays a crucial role in assessing the performance of machine learning models. It quantifies the proportion of correctly classified samples, indicating how well the model predicts both benign and malignant tumors among the total number of samples in the dataset. In our study, the machine learning models achieved an accuracy of 92.11%. This accuracy score demonstrates the model’s proficiency in accurately classifying breast tumors as either benign or malignant, showcasing their effectiveness in the classification task. The loss value obtained during the evaluation of the machine learning models was 0.29%. In the context of machine learning, loss serves as a vital metric that quantifies the discrepancy between the model's predicted values and the actual ground truth values in the dataset. A lower loss value is indicative of a better-performing model, as it suggests that the model's predictions align more closely with the true labels.

Conclusion

In summary, this research study has focused on the application of machine learning algorithms for breast cancer tumor classification. By utilizing a carefully curated dataset containing essential tumor characteristics, the findings evaluated a neural network model's performance in distinguishing between benign and malignant tumors. The results obtained demonstrated the effectiveness of the machine learning approach, with an accuracy score of 92.11% during the evaluation. This level of accuracy highlights the model's proficiency in accurately classifying breast tumors, indicating its potential clinical relevance in assisting with breast cancer diagnosis. This study underscores the significant potential of machine learning algorithms as valuable tools in the realm of medical research and healthcare, particularly in the context of breast cancer diagnosis. By providing accurate and timely tumor classification, these technologies can have a tangible impact on the lives of breast cancer patients and the broader medical community. Moving forward, continued research and collaboration between machine learning experts, medical professionals, and researchers are essential to further refine and implement these innovative technologies in clinical practice. Such efforts hold the promise of enabling more precise and effective approaches to breast cancer diagnosis and treatment, ultimately leading to improved patient outcomes and advancing the fight against breast cancer.

Author Preesha Juyal is a seveneteen-year-old living in the United States with a love for her dog, Biscuit, shopping, and obviously, technology.

References

Snead, David R J, et al. “Validation of Digital Pathology Imaging for Primary Histopathological Diagnosis.”
Histopathology, vol. 68, no. 7, 6 Dec. 2015, pp. 1063–1072.
WHO. “Breast Cancer.” Www.who.int, World Health Organization, 26 Mar. 2023.

Predicting Breast Cancer with Machine Learning

Recent Posts

1 Comment