27. Exploratory/Inferential Analysis on the Heart Disease Dataset#

pip install ucimlrepo
Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2.2.2)
Requirement already satisfied: certifi>=2020.12.5 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2024.8.30)
Requirement already satisfied: numpy>=1.22.4 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.16.0)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7

Carichiamo i dati dal repository di UCI ML

from ucimlrepo import fetch_ucirepo

# fetch dataset
heart_disease = fetch_ucirepo(id=45)

# data (as pandas dataframes)
X = heart_disease.data.features
y = heart_disease.data.targets

data = X.join(y)
data
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
0 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
2 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
298 45 1 1 110 264 0 0 132 0 1.2 2 0.0 7.0 1
299 68 1 4 144 193 1 0 141 0 3.4 2 2.0 7.0 2
300 57 1 4 130 131 0 0 115 1 1.2 2 1.0 7.0 3
301 57 0 2 130 236 0 2 174 0 0.0 2 1.0 3.0 1
302 38 1 3 138 175 0 0 173 0 0.0 1 NaN 3.0 0

303 rows × 14 columns

#import pandas as pd
#pd.read_csv('heart.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  num       303 non-null    int64  
dtypes: float64(3), int64(11)
memory usage: 33.3 KB
for column in data.columns:
  x = data[column].isna().sum()
  print("Number of NaN in column", column, "is", x)
Number of NaN in column age is 0
Number of NaN in column sex is 0
Number of NaN in column cp is 0
Number of NaN in column trestbps is 0
Number of NaN in column chol is 0
Number of NaN in column fbs is 0
Number of NaN in column restecg is 0
Number of NaN in column thalach is 0
Number of NaN in column exang is 0
Number of NaN in column oldpeak is 0
Number of NaN in column slope is 0
Number of NaN in column ca is 4
Number of NaN in column thal is 2
Number of NaN in column num is 0
import seaborn as sns

sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7b5cf0507370>
../_images/b483707d2e4b9da2a2a60e4c22cfe851383dbb8ab76819ae0963e2671d2fa7ed.png
data['num'].unique()
array([0, 2, 1, 3, 4])

27.1. Does age affect maximum heart rate?#

from matplotlib import pyplot as plt
plt.scatter(data['age'], data['thalach'])
plt.show()
../_images/4dd92440c0b9af67adafa6fca458b6128a9bdd03db023901f5236122b5d4cea1.png
from scipy.stats import pearsonr

pearsonr(data['age'], data['thalach'])
PearsonRResult(statistic=-0.3938058059868209, pvalue=1.1092756352878365e-12)
sns.regplot(data, x='age', y='thalach')
plt.grid()
../_images/26edf8ec9ab9c6cebed7f3bd30f3f1ea7e5d1e346313f0f6b8dbe803d042f083.png
from statsmodels.formula.api import ols

ols("thalach ~ age", data).fit().summary()
OLS Regression Results
Dep. Variable: thalach R-squared: 0.155
Model: OLS Adj. R-squared: 0.152
Method: Least Squares F-statistic: 55.25
Date: Thu, 21 Nov 2024 Prob (F-statistic): 1.11e-12
Time: 16:54:49 Log-Likelihood: -1352.3
No. Observations: 303 AIC: 2709.
Df Residuals: 301 BIC: 2716.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 203.8634 7.399 27.553 0.000 189.303 218.424
age -0.9966 0.134 -7.433 0.000 -1.261 -0.733
Omnibus: 19.198 Durbin-Watson: 1.744
Prob(Omnibus): 0.000 Jarque-Bera (JB): 21.705
Skew: -0.654 Prob(JB): 1.94e-05
Kurtosis: 2.922 Cond. No. 338.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
import seaborn as sns

sns.pairplot(data.drop(['cp','fbs','restecg','exang','slope','ca','thal'], axis=1), hue='sex')
<seaborn.axisgrid.PairGrid at 0x7b5cd56825c0>
../_images/d7f25bf799540a53aa1f47df342119bae3d2ebc3a4d9bd6ef474adc9cc84ed4c.png
data['sex'].value_counts().plot.bar()
<Axes: xlabel='sex'>
../_images/7a479e0b59f6fb4d55adfe68b9bbfe52e24f1bd38e61fc3a5db41621d5bd6cf9.png
data['sex'].value_counts(normalize=True).plot.bar()
<Axes: xlabel='sex'>
../_images/34f75ac7eebed82909390d9414d16a79abb4e917c6ea1974a62dd4e0e7ca793f.png
pd.crosstab(data['sex'], data['num'],normalize='index').plot.bar(stacked=True)
<Axes: xlabel='sex'>
../_images/a197f19ea2dd389b3b76d9be5ed3801c68dd7fef6c5e392d7b3b07d6ebc4045e.png
pd.crosstab(data['sex'], data['num']).plot.bar(stacked=True)
<Axes: xlabel='sex'>
../_images/5f0bf41f8b7db67bd3563317675ea828e528651570abeb3e8c629613dee0e060.png
from scipy.stats import chi2_contingency
from scipy.stats.contingency import association

# Perform the Chi-Square Test for Independence
contingency_table = pd.crosstab(data['sex'], data['num'])
chi2, p, _, exp_freq = chi2_contingency(contingency_table)

# Set the significance level (alpha)
alpha = 0.05

# Print the results
print("Chi-Square Statistic:", chi2)
print("p-value:", p)

print("Cramer's V:", association(contingency_table))

# Interpret the results
if p < alpha:
    print("\nThere is a significant association between 'Sex' and 'survived'.")
else:
    print("\nThere is no significant association between 'Sex' and 'survived'.")
Chi-Square Statistic: 23.425407307562654
p-value: 0.00010410590365461076
Cramer's V: 0.27804959139060786

There is a significant association between 'Sex' and 'survived'.
data_binary = data.copy()
data_binary['num'] = data_binary['num'].apply(lambda x: 1 if x >= 1 else 0)
crosstab_binary = pd.crosstab(data_binary['sex'], data_binary['num'])
crosstab_binary
num 0 1
sex
0 72 25
1 92 114
from scipy.stats.contingency import relative_risk
relative_risk(crosstab_binary[1][1], crosstab_binary[1].sum(), crosstab_binary[0][1], crosstab_binary[0].sum())
#relative_risk(contingency['Exposed']['Diseased'],contingency['Exposed'].sum(),contingency['Non Exposed']['Diseased'], contingency['Non Exposed'].sum())
RelativeRiskResult(relative_risk=1.4619956208945886, exposed_cases=114, exposed_total=139, control_cases=92, control_total=164)
# prompt: can you invert the rows of crosstab_binary

crosstab_binary_inverted = crosstab_binary.iloc[::-1]
crosstab_binary_inverted = crosstab_binary_inverted[[1,0]]
crosstab_binary_inverted
num 1 0
sex
1 114 92
0 25 72
from scipy.stats.contingency import odds_ratio


odds_ratio(crosstab_binary_inverted)
OddsRatioResult(statistic=3.55354424429681)

27.2. Logistic regression analysis on binarized data#

from statsmodels.formula.api import logit
logit("num ~ sex", data_binary).fit().summary()
Optimization terminated successfully.
         Current function value: 0.650054
         Iterations 5
Logit Regression Results
Dep. Variable: num No. Observations: 303
Model: Logit Df Residuals: 301
Method: MLE Df Model: 1
Date: Thu, 21 Nov 2024 Pseudo R-squ.: 0.05754
Time: 17:36:26 Log-Likelihood: -196.97
converged: True LL-Null: -208.99
Covariance Type: nonrobust LLR p-value: 9.390e-07
coef std err z P>|z| [0.025 0.975]
Intercept -1.0578 0.232 -4.557 0.000 -1.513 -0.603
sex 1.2722 0.271 4.692 0.000 0.741 1.804
import numpy as np
odds = np.exp(-1.0578)
p = odds/(1+odds)
p
0.2577301019775979
np.exp(1.2722)
3.568695061578707
odds_m = odds*np.exp(1.2722)
odds_m/(1+odds_m)
0.5533956185614641

27.3. Multinomial Logistic Regression Analysis#

from statsmodels.formula.api import mnlogit
mnlogit("num ~ sex", data).fit().summary()
Optimization terminated successfully.
         Current function value: 1.239318
         Iterations 7
MNLogit Regression Results
Dep. Variable: num No. Observations: 303
Model: MNLogit Df Residuals: 295
Method: MLE Df Model: 4
Date: Thu, 21 Nov 2024 Pseudo R-squ.: 0.03141
Time: 17:46:39 Log-Likelihood: -375.51
converged: True LL-Null: -387.69
Covariance Type: nonrobust LLR p-value: 6.775e-05
num=1 coef std err z P>|z| [0.025 0.975]
Intercept -2.0794 0.354 -5.882 0.000 -2.772 -1.386
sex 1.3863 0.397 3.492 0.000 0.608 2.164
num=2 coef std err z P>|z| [0.025 0.975]
Intercept -2.3308 0.396 -5.887 0.000 -3.107 -1.555
sex 1.1763 0.450 2.617 0.009 0.295 2.057
num=3 coef std err z P>|z| [0.025 0.975]
Intercept -2.3308 0.396 -5.887 0.000 -3.107 -1.555
sex 1.1412 0.451 2.531 0.011 0.257 2.025
num=4 coef std err z P>|z| [0.025 0.975]
Intercept -3.5835 0.717 -4.999 0.000 -4.989 -2.178
sex 1.4596 0.785 1.860 0.063 -0.078 2.998
odds1=np.exp(-2.0794)

odds2=np.exp(-2.0794)*np.exp(1.3863)

p1 = odds1/(1+odds1)
p2 = odds2/(1+odds2)

print(p1, p2)
0.11111521405935315 0.3333438179846534