Cervical Cancer Risk Factors: Exploratory Analysis Using HPCC Systems
Cervical cancer is a leading cause of cancer-related death among women, with about half a million new cases worldwide in 2018 (WHO, 2018). 90% of cervical cancer deaths occur in low resource settings. This mortality could be reduced through effective prevention, screening, and treatment programs. HPV (Human Papillomavirus) vaccinations reduce cervical cancer risk, but not all populations have access to HPV vaccinations. During HPCC Systems Tech Talk 24, Itauma Itauma, Ph.D., spoke about the exploratory analysis of a cervical cancer database using HPCC Systems with Data Visualizations. To access the Tech Talk, please using the following link: Cervical Cancer Risk Factors: Exploratory Analysis Using HPCC Systems. The findings from this analysis could be beneficial in resource-scarce settings with limited access to cervical cancer screenings and HPV vaccinations.
In this blog, we will give a high level overview of Exploratory Data Analysis (EDA) and Data Visualization, and examine how these tools were effectively used to synthesize and present data for a cervical cancer study.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach to data analysis that obtains information about a dataset by showing the structure and model of the data. Patterns, trends, and outliers are examined to determine next steps and areas of research. EDA is an important first step in data analysis.
Data Visualizations are an important means of conveying information from a large amount data. A comprehensive representation of the information using pie charts, line graphs, maps, and other visual graphs simplifies complex data and helps produce actionable analysis.
HPCC Systems offers an open-source data visualizer add-on to the HPCC platform that allows creation of Data Visualizations from the results of queries written in ECL (Enterprise Control Language). To find out more about the HPCC Systems Visualizer Bundle, please use the following link: https://github.com/hpcc-systems/Visualizer.
I also recommend reading the blog post, The HPCC Systems Visualizer, by Lorraine Chapman. This blog provides in-depth information about the HPCC Systems Visualizer Bundle.
Below is an example of Data Visualizations using the HPCC Systems Visualizer:
For more information on Exploratory Data Analysis (EDA) and Data Visualization, see Itauma Itauma’s presentation from Tech Talk 12 using the following link: Conducting Exploratory Data Analysis in Educational Research Using HPCC Systems..
Now, let’s take a look at how Exploratory Data Analysis (EDA) and Data Visualizations were applied during a cervical cancer study.
Cervical Cancer Study Objective
The objective for this study was to explore risk and protective factors for cervical cancer using HPCC Systems. Factors such as age, HPV, cigarette smoking, and IUD use were examined.
For this study, Exploratory Data Analysis (EDA) and Data Visualization techniques were used to understand the associations between variables related to cervical cancer. The source data included a dataset containing the identified medical records of 858 randomly selected females, who were patients at the Hospital Universitario de Caracas, Venezuela between 2012 and 2013. This dataset is publically available on the Machine Learning Repository website of the University of California Irvine, Dataset – Risk Factors for Cervical Cancer – UCI Irvine.
The source dataset includes the following information:
- Patient’s age
- Number of pregnancies
- Smoking history
- Contraceptive history
- STD history, including HPV
- Cervical Intraepithelial Neoplasia (CIN) diagnosis
- Cervical Cancer diagnosis
The dataset information was sprayed to the HPCC Systems cluster for analysis, and a web connection was made to Tableau for Data Visualization. There is a great blog about the HPCC Systems/Tableau WebData Connector on the HPCC Systems website, HPCC Systems/Tableau Web Data Connector v0.2 Tech Preview.
Cervical Cancer Study Results – Tableau
The results of the cervical cancer study are represented in a series of graphs and charts, using Tableau.
The charts below show the age range of cervical cancer patients, and the correlation between cervical cancer and cigarette smoking. The Data Visualizations show the following:
- The ages of the cervical cancer patients range from 13 to 84 years.
- Cervical cancer occurs in patients from 19 to 52 years old. This is consistent with published findings that show that cervical cancer is rare in patients younger than age 15.
- Regarding cigarette smoking, the incidence of cervical cancer was higher with a longer average pack/year of smoking.
- HPV infection was found in nearly all cases of cervical cancer with greater than one pack/year of smoking.
The following charts and graphs illustrate the impact of IUD’s on HPV infection and cervical cancer. The Data Visualizations show the following:
- According to the histogram, longer duration of IUD use appears to protect against cervical cancer and HPV infection.
- The orange represents cases with HPV. The larger cycle depicts cervical cancer cases.
- HPV infections were noted in cervical cancer cases with a shorter duration of IUD use.
- The scattered log shows that the longer the use of IUD’s, the lower the number of STD’s noted.
- No cervical cancer cases were seen with greater than 10 years of IUD use.
This study on cervical cancer demonstrated how the use of Exploratory Data Analysis (EDA) and Data Visualizations allow for the extraction of important information from large amounts of data. Health analytics can be used to improve quality of life, especially for those who do not have access to decent healthcare. Using Exploratory Data Analysis (EDA) and Data Visualization to identify the correlation between variables in health-related data helps identify risk factors and other vital information essential to the development of prevention, screening, and treatment programs.
In this study, findings on Data Visualizations were consistent with published literature findings about cervical cancer. According to this study:
- More cases of cervical cancer were seen with a longer average pack/year of smoking and HPV infection.
- For contraception, IUD use may be beneficial, and possibly provides additional protection against cervical cancer.
- HPV infection has the strongest effect on the risk of cervical cancer.
About Itauma Itauma
Itauma Itauma has a Ph.D. in Instructional Design and Technology from Keiser University and is a student of the Harvard Business Analytics Program. His interests lie in education analytics, health analytics, and promoting diversity in STEM. He has an undergraduate degree in Electrical Engineering from the University of Ilorin, a Master of Science in Computer Engineering from Istanbul Technical University, majoring in human-robot interaction, and a Master of Science in Computer Science from Wayne State University where his thesis was based on leveraging HPCC Systems for Big Data analytics.
Summary of Link References
Cervical Cancer Risk Factors: Exploratory Analysis Using HPCC Systems – Link to Tech Talk 24
https://github.com/hpcc-systems/Visualizer – HPCC Systems Visualizer Bundle
The HPCC Systems Visualizer– Blog post on the HPCC Systems Visualizer
Conducting Exploratory Data Analysis in Educational Research Using HPCC Systems – Tech Talk 12 Presentation
Dataset – Risk Factors for Cervical Cancer – UCI Irvine – UCI Irvine – Dataset available on the Machine Learning Repository website of the University of California Irvine
HPCC Systems/Tableau Web Data Connector v0.2 Tech Preview – Blog post on the HPCC Systems/Tableau Web Data Connector v0.2 Tech Preview