Using Machine Learning to Detect Polycystic Ovary Syndrome

Leveraging artificial intelligence to detect the leading cause of infertility, a condition which millions of women don’t know they have.

Polycystic Ovary Syndrome.

A three-letter word that has impacted the lives and future of millions of women and girls worldwide.

Polycystic Ovary Syndrome, or PCOS, is a condition that not only affects a woman’s ovaries, but the outcome of her life.

PCOS doesn’t just affect a select group of individuals. Nearly 6 to 12% of women have PCOS, meaning that at least 5 million women in the United States alone are affected by it.

PCOS is the most common endocrine disorder among women of childbearing age and is the leading cause of infertility around the world.

But to understand what PCOS is, we need to take a step back and look at the female reproductive system.


The female reproductive system itself undergoes several processes in order to prepare and protect a woman’s body for reproduction. The reproductive system includes the vagina, cervix, uterus, and two fallopian tubes, each of which extends to one of the ovaries.

Within the ovaries are the follicles, which secrete hormones that influence the various stages of the menstrual cycle, or a woman’s period.

Depending on the number of hormones produced, particularly the follicle-stimulating hormone (FSH) produced by the follicles, the estrogen and progesterone produced by the ovaries, and the luteinizing hormone (LH) produced within the pituitary gland, an individual will be in a different stage of their menstrual cycle.

Menstruation: The 3–7 days of your period, in which the top layers of the endometrium (lining of the uterus) are shed. During menstruation, levels of all four hormones are at the lowest levels.

Follicular Phase: The 7 to 10 days between the start of the menstrual cycle (with the onset of menstruation) and ovulation. The production of the Follicle Stimulating Hormone is slightly increased, which stimulates the growth of between 3 and 30 follicles. As the FSH declines later in the phase, only one follicle (the dominant follicle) continues to grow and soon starts to produce estrogen, which begins to prepare the uterus and initiate the surge of LH.

Ovulation: The release of an egg from the ovaries, traveling through the fallopian tubes. Takes place 3–4 days following the follicular phase. The LH surges and stimulates the dominant follicle to rupture, releasing the egg. Egg release occurs randomly between both ovaries.

Luteal Phase: The 10–14 days between ovulation and the start of menstruation, assuming that the egg has not been fertilized. The ruptured follicle closes after releasing the egg and forms a structure called a corpus luteum, which produces increasing quantities of progesterone. The progesterone causes the thickening of the uterine lining. The estrogen also rises during this stage and also helps thicken the uterine lining.


This is the essence of what happens in a woman’s body nearly every single month during her reproductive years.

However, PCOS disrupts this imminent biological process.

The ovaries don’t just produce FSH, estrogen, and progesterone; they also produce small amounts of androgens, or male sex hormones. These include testosterone, dehydroepiandrosterone sulfate (DHEAS), dehydroepiandrosterone (DHEA), androstenedione, and androstenediol, although women often convert a majority of these hormones, especially testosterone and androstenediol, into estrogen.

PCOS occurs when the ovaries produce an abnormal amount of these androgens, causing the development of numerous small cysts, or fluid-filled sacs, within the ovaries. While some women with the condition do not develop cysts, many do.


In some cases, women also do not make enough of the hormones needed for ovulation. This can also be the cause for the development of cysts within the ovaries. The cysts produce androgens. The presence of these androgens can cause more problems with a woman’s menstrual cycle, in addition to other side effects including infertility, weight gain, and acne.

Yet, PCOS is not limited to the ovaries, or even to the matter of fertility.

Women with PCOS are at a higher risk for insulin resistance, high blood sugar, obesity, high cholesterol, high blood pressure, Type 2 Diabetes, and other cardiovascular diseases.

One of the biggest issues with PCOS is that no one woman with PCOS looks or even experiences the same things as another.

A common notion is that PCOS only affects women that are overweight or struggle from a related disease or one of the symptoms of the condition.

But there is no cutout for what a woman with PCOS looks like or experiences.

PCOS affects women of all ethnicities, backgrounds, and sizes. It affects models and athletes to oppressed populations, 16-year-old girls to 45-year-old women. Women with cysts who are infertile and experience physical side effects like weight gain or acne, to women without cysts and don’t experience any physical symptoms other than irregular menstrual cycles.

Despite the range of issues caused by PCOS and the prevalence of the condition, PCOS can be difficult to diagnose because some of its symptoms have a variety of potential causes. Heavy menstrual bleeding and other irregularities, could be caused by a range of conditions, such as uterine fibroids, polyps, bleeding disorders, certain medications, or pelvic inflammatory disease, in addition to PCOS.

Even though it has been over 80 years since the condition, was discovered, there is nodefinitive test to diagnose it, a cure, or a total agreement on what causes it.

Women after women share their stories of being misunderstood, misdiagnosed, and turned away when seeking help. More than 50% of women with PCOS remain officially undiagnosed and unable to receive the help they need.

This is where machine learning comes in.

By leveraging machine learning, a subset of artificial intelligence, we can create a model that can take in a patient’s information and detect if they have PCOS.

It sounds too good to be true right?

Well, it might not be so far in our future.

Over the past few days, I built a machine learning model that can take in various metrics of the patient, from their Body Mass Index (BMI), to their blood pressure, to concentrations of certain hormones. Using this information, the model can detect PCOS within the patient with a simple “yes” or “no,” in the output of a “1” or a “0.”

Here’s how I did it…

1. Importing the Libraries and Datasets

The first step to building our model is to import our libraries and datasets into our Google Colab notebook.

pandas: The most popular python library that is used for data manipulation and analysis. In this project, it is primarily useful for dataframe manipulation.

NumPy: A python library that provides support for large, multi-dimensional arrays and matrices, and has high-level mathematical functions to help operate on and manipulate these arrays.

matplotlib.pyplot and seaborn: Used for data visualization.

We can start the project by making sure we have installed the latest version of seaborn, which will be used for data visualization.

Then, we can go ahead and import our libraries, and check to make sure we have installed the latest version of seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Once we have imported our libraries, we can import our dataset.

However, before importing our data, we need to download it and remove any unnecessary columns, making it much easier to use once it is downloaded.

The dataset I used for this project contains data from 541 patients across 10 hospitals in Kerala, India. The data contains over 40 physical and clinical parameters for each patient, which will be used as our inputs.

Once we have downloaded the data as a CSV file, we can open the data that shows us the parameters of patients without infertility, as that is the dataset we will be using.

Once we have opened the dataset, we can the range of features, or columns we have, from the patient’s age to the concentration of FSH within their bloodstream, obtained from a blood test.

We can also get rid of some of these columns, making the data easier to use once we have imported it into our notebook. As we can see, the column titled “Sl. No” and “Patient File No.” will not serve any purpose as they don’t need to be included within the input, so we can delete both of these columns.

The first column on the dataset should be “PCOS (Y/N),” which will be the output of our model. The remaining columns will be the inputs.

We can also see that the last column on the dataset is completely empty, and we can remove that column as well.

Once we have done so, we can save this new file as a CSV file and import this data into our notebook, and run the code.

from google.colab import files
uploaded = files.upload()
pcos = pd.read_csv('PCOS_no_infertility.csv')

The last line of this block will print the first 15 rows of data, Row 0 to Row 14, allowing us to see the features, of the data, within the dataframe, shortened to df.

2. Exploratory Data Analysis

Once we have imported the necessary libraries and our dataset, we can conduct an exploratory data analysis to further clean up our data by correcting errors, reformatting, and removing missing values.

To begin, we want to set df = pcos which will allow us to work df instead of pcos when exploring our dataframe.

We also want to use the labels within Row 0 as our column titles, instead of the “Unnamed: #” titles we have now.

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header

Running this code allows us to replace the column titles with the labels in the first row of the dataset. We can now run df to see our new dataset.

Here, we can also see the number of rows and columns we have. This shows us that we have 541 rows, for each of the patients, and 42 columns, 41 of which are our input features.

Then, we want to go ahead and start looking for null values within our data. In order to check for null values, we need to run df.isnull( ).sum( ) and should see a list that looks like the following.

Here we can see that we have a couple of null values scattered throughout the dataset, and can eliminate them by running df = df.dropna( ) in the next code block, which will drop the null values, as we can see on the right.

Once we have dropped the null values, we want to see the datatypes of the columns within our dataset, which we can do by running ), which shows us that all of our data is an object datatype. However, we need to convert our data into numerical values for our model to be able to process it.

In order to convert our data into a numerical datatype, we can run a loop to convert all of the values into numerical values, either int64 or float64.

for column in df:
columnSeriesObj = df[column]
df[column] = pd.to_numeric(df[column], errors='coerce')

Then, when we run ), we can see that all of our data has turned into numerical values.

Now that we have cleaned up our data, we can visualize it through a series of visuals.

Here, we can see the relationships between various features. For example, the matrix on the bottom left shows us the relationship between the patient’s age and their BMI.

In this block of code, we have specified to see the relationships between columns 1 and 5, however, we can increase or decrease this number to see visuals of other features within our data.

We can also visualize the distribution of data within certain features through a histogram.

3. Data Visualization

Once we have finished exploring our dataset and visualizing our dataframe, we can look at the correlations within our data.

First, we can find the correlations between the columns, or features, of our data by running df.corr( ).

In this grid, we can see the correlation between features. A number that is close to or equal to 1 means that there is almost a perfect correlation, while numbers equal to or close to 0 mean that there is nearly no correlation. Any negative values indicate an inverse correlation.

For example, we can see that the BMI and the weight of the individual are quite heavily correlated, with a value of 0.901719. However, the height of the individual and the length of their menstrual cycle have nearly no correlation with a value of 0.007512.

We can also visualize these correlations by creating a heatmap.

corr_matrix= df.corr()
sns.heatmap(corr_matrix, annot = True, fmt = ".2f");
plt.title("Correlation Between Features")

The figsize establishes the size of the heatmap, annot = True, shows us the numerical values inside of the heatmap, and fmt = ‘.2f’ shows us the numerical correlation rounded to the hundredths place.

4. Preparing Data Before Model Training

Now that we have cleaned up our data and visualized the correlations within the dataframe, we need to prepare the data before training the model.

First, we need to split the data into independent X and Y datasets. The first column is the output of the model, the diagnosis of the patient, while the remaining 41 columns are the features or inputs of the model.

X = df.iloc[:,1:41].values
Y = df.iloc[:,0].values

Once we have established the X and Y datasets, we can split 70% of the data into training data, and 30% of it into testing data.

The training data is simply used to train the model. We just feed the data to the model, so that it can learn the relationships between the inputs and the outputs.

The testing data is used after the model is trained. After we have finished the training and iterations, we feed the testing data to the model. The model will never have seen the testing data during training.

from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3 , random_state = 0)

Now, we need to scale our data before feeding it to the algorithm. Scaling the data, or feature scaling, simply means that all of our features fit within a certain range, whether that be between 0 and 1 or 0 and 100.

To scale our data, we can import the StandardScaler from sklearn and pass along our training and testing data.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

5. Training and Evaluating the Model

Within this project, I compared the accuracy of two different classification algorithms — a logistic regression model and a random forest classifier — and compared their accuracy rates on detecting PCOS within an individual.

Logistic Regression: An algorithm that is used when the value of the target variable is categorical in nature. It is used when the data belongs to one class or another, in this case, it is classifying a tumor as either benign or malignant.

Random Forest Classification: An algorithm that operates by constructing a multitude of decision trees at training time. Each of the decision trees is built by randomly collecting a sample of the data.

Both models follow a similar structure. We can import each of the three algorithms from scikit learn, and set the random state to be zero. Within the Random Forest Classifier, we need to set the criterion = entropy, which is used to calculate information gain within the nodes of the decision trees.

Furthermore, random_state = 0 must be established in both of the algorithms. Setting the random_state to a fixed value will guarantee that the same sequence of random numbers is generated each time the code is run. This helps to verify the output of the model.

Then, we can run model = models(X_train, Y_train) to see the accuracy of all three models on the training data.

As we can see, the Random Forest Classifier performed with 100% accuracy, and the Logistic Regression Model performed with about 92% accuracy. While these percentages are quite high right now, they will drop when performing on the testing data, as the models have not seen the testing data yet, and therefore, cannot iterate on it.

In order to run the models on the testing data, we need to import a confusion matrix. The confusion matrix will run the models on the testing data, and tell us the number of true positives, true negatives, false positives, and false negatives that the model has classified.

The Logistic Regression Model reached an accuracy of about 89.4% on the testing data, and the Random Forest Classifier reached an accuracy of about 87.6%.

By changing the number of estimators, playing around with feature selecting, and hyperparameter optimization, we can potentially increase the accuracy rate of both models.

By improving the performance of this model, while reducing the number of features necessary for the model to make an accurate prediction, we can hopefully utilize machine learning to help women with PCOS receieve a diagnosis and treatment.

Check out my code for the project here and the video I made explaining the code in detail.

16 y/o working in healthtech & women’s health | a collection of my thoughts |