Using Machine Learning to Detect Polycystic Ovary Syndrome

Leveraging artificial intelligence to detect the leading cause of infertility, a condition which millions of women don’t know they have.

PCOS is the most common endocrine disorder among women of childbearing age and is the leading cause of infertility around the world.

But to understand what PCOS is, we need to take a step back and look at the female reproductive system.


Even though it has been over 80 years since the condition, was discovered, there is nodefinitive test to diagnose it, a cure, or a total agreement on what causes it.

Women after women share their stories of being misunderstood, misdiagnosed, and turned away when seeking help. More than 50% of women with PCOS remain officially undiagnosed and unable to receive the help they need.

1. Importing the Libraries and Datasets

The first step to building our model is to import our libraries and datasets into our Google Colab notebook.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from google.colab import files
uploaded = files.upload()
pcos = pd.read_csv('PCOS_no_infertility.csv')

2. Exploratory Data Analysis

Once we have imported the necessary libraries and our dataset, we can conduct an exploratory data analysis to further clean up our data by correcting errors, reformatting, and removing missing values.

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
for column in df:
columnSeriesObj = df[column]
df[column] = pd.to_numeric(df[column], errors='coerce')

3. Data Visualization

Once we have finished exploring our dataset and visualizing our dataframe, we can look at the correlations within our data.

corr_matrix= df.corr()
sns.heatmap(corr_matrix, annot = True, fmt = ".2f");
plt.title("Correlation Between Features")

4. Preparing Data Before Model Training

Now that we have cleaned up our data and visualized the correlations within the dataframe, we need to prepare the data before training the model.

X = df.iloc[:,1:41].values
Y = df.iloc[:,0].values
from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3 , random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

5. Training and Evaluating the Model

Within this project, I compared the accuracy of two different classification algorithms — a logistic regression model and a random forest classifier — and compared their accuracy rates on detecting PCOS within an individual.

16 y/o working in healthtech & women’s health | a collection of my thoughts |