33. Linear and Logistic Regression Laboratory#

33.1. Linear Regression#

We will use the datatset available at this URL:

https://www.kaggle.com/datasets/yasserh/advertising-sales-dataset

Let us load the dataset:

import pandas as pd
advertising=pd.read_csv('Advertising Budget and Sales.csv')
advertising=advertising.rename(columns={
    'Unnamed: 0':'Index',
    'TV Ad Budget ($)': 'tv',
    'Radio Ad Budget ($)': 'radio',
    'Newspaper Ad Budget ($)': 'newspaper',
    'Sales ($)': 'sales'
    })
advertising.set_index('Index')
tv radio newspaper sales
Index
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
... ... ... ... ...
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4

200 rows × 4 columns

advertising.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Index      200 non-null    int64  
 1   tv         200 non-null    float64
 2   radio      200 non-null    float64
 3   newspaper  200 non-null    float64
 4   sales      200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB
import seaborn as sns
sns.pairplot(advertising.drop('Index',axis=1))
<seaborn.axisgrid.PairGrid at 0x7b57ba39b460>
../_images/980f2b129fbff1b6cbae22e4c90892b69c290d5f33f88d7ce4b3146547fd6903.png
sns.heatmap(advertising.corr(), annot=True)
<Axes: >
../_images/20c6958139b10f4a6e7ff8d4b53f0ff8ef16f8ec965ff73dbd2178b0761f024f.png
from scipy.stats import pearsonr
pearsonr(advertising['tv'],advertising['radio'])
PearsonRResult(statistic=0.05480866446583009, pvalue=0.440806063788431)

The correlation between tv and radio is not statistically significant.

pearsonr(advertising['newspaper'],advertising['radio'])
PearsonRResult(statistic=0.35410375076117534, pvalue=2.688835078719109e-07)
from statsmodels.formula.api import ols
ols("newspaper ~ radio", advertising).fit().summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 18.4700 2.689 6.870 0.000 13.168 23.772
radio 0.5194 0.097 5.328 0.000 0.327 0.712
ols("radio ~ newspaper", advertising).fit().summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 15.8883 1.699 9.354 0.000 12.539 19.238
newspaper 0.2414 0.045 5.328 0.000 0.152 0.331
sns.regplot(x='radio', y='newspaper', data=advertising)
<Axes: xlabel='radio', ylabel='newspaper'>
../_images/a7e28d74b207684d5cc8d900f00fbb617e9520ae39dcfda48f7349b1872b43bf.png
ols("sales ~ tv", advertising).fit().summary()
OLS Regression Results
Dep. Variable: sales R-squared: 0.612
Model: OLS Adj. R-squared: 0.610
Method: Least Squares F-statistic: 312.1
Date: Thu, 30 Nov 2023 Prob (F-statistic): 1.47e-42
Time: 19:48:09 Log-Likelihood: -519.05
No. Observations: 200 AIC: 1042.
Df Residuals: 198 BIC: 1049.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.0326 0.458 15.360 0.000 6.130 7.935
tv 0.0475 0.003 17.668 0.000 0.042 0.053
Omnibus: 0.531 Durbin-Watson: 1.935
Prob(Omnibus): 0.767 Jarque-Bera (JB): 0.669
Skew: -0.089 Prob(JB): 0.716
Kurtosis: 2.779 Cond. No. 338.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
sns.regplot(x='tv', y='sales', data=advertising)
<Axes: xlabel='tv', ylabel='sales'>
../_images/8f34a70be102736c36d55fbe07ab370fd479f365d2b562973a978a685df6f1ec.png
import statsmodels.api as sm
from matplotlib import pyplot as plt
model=ols("sales ~ tv", advertising).fit()

#otteniamo i valori predetti dal modello:
fitted = model.fittedvalues.fillna(0) #rimpiazzo eventuali NaN con zero

plt.figure(figsize=(20,22))
sns.residplot(x=fitted, y='sales', data=advertising.dropna(),lowess=True,line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8}, ax=plt.subplot(421))
sm.qqplot(fitted-advertising.dropna()['sales'], line='45',fit=True, ax=plt.subplot(422))
plt.show()
../_images/58d525f55548a58a65e88b26e2631c29b0abe98db849974e2dc5fbc76aeac828.png
ols("sales ~ tv + radio", advertising).fit().summary()
OLS Regression Results
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 859.6
Date: Thu, 30 Nov 2023 Prob (F-statistic): 4.83e-98
Time: 19:48:42 Log-Likelihood: -386.20
No. Observations: 200 AIC: 778.4
Df Residuals: 197 BIC: 788.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.9211 0.294 9.919 0.000 2.340 3.502
tv 0.0458 0.001 32.909 0.000 0.043 0.048
radio 0.1880 0.008 23.382 0.000 0.172 0.204
Omnibus: 60.022 Durbin-Watson: 2.081
Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679
Skew: -1.323 Prob(JB): 5.19e-33
Kurtosis: 6.292 Cond. No. 425.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
import statsmodels.api as sm
from matplotlib import pyplot as plt
model=ols("sales ~ tv + radio", advertising).fit()

#otteniamo i valori predetti dal modello:
fitted = model.fittedvalues.fillna(0) #rimpiazzo eventuali NaN con zero

plt.figure(figsize=(20,22))
sns.residplot(x=fitted, y='sales', data=advertising.dropna(),lowess=True,line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8}, ax=plt.subplot(421))
sm.qqplot(fitted-advertising.dropna()['sales'], line='45',fit=True, ax=plt.subplot(422))
plt.show()
../_images/a0a71e9f0c8838637734188a3698ff325569adb187a91b423c1fea45bd04c279.png

Let us add an interaction term:

ols("sales ~ tv   + radio + radio*tv", advertising).fit().summary()
OLS Regression Results
Dep. Variable: sales R-squared: 0.968
Model: OLS Adj. R-squared: 0.967
Method: Least Squares F-statistic: 1963.
Date: Thu, 30 Nov 2023 Prob (F-statistic): 6.68e-146
Time: 19:49:14 Log-Likelihood: -270.14
No. Observations: 200 AIC: 548.3
Df Residuals: 196 BIC: 561.5
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.7502 0.248 27.233 0.000 6.261 7.239
tv 0.0191 0.002 12.699 0.000 0.016 0.022
radio 0.0289 0.009 3.241 0.001 0.011 0.046
radio:tv 0.0011 5.24e-05 20.727 0.000 0.001 0.001
Omnibus: 128.132 Durbin-Watson: 2.224
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1183.719
Skew: -2.323 Prob(JB): 9.09e-258
Kurtosis: 13.975 Cond. No. 1.80e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The regressor with interaction terms can be interpreted as follows:

\[sales = \beta_0 + \beta_1 tv + \beta_2 radio + \beta_3 radio \times tv\]
\[sales = \beta_0 + tv(\beta_1 + \beta_3 radio) + \beta_2 radio\]
\[sales = \beta_0 + \beta_1 tv + radio(\beta_2 + tv \beta_3)\]

33.2. Linear Regression vs Mean Values#

import numpy as np
x = np.random.normal(0,2,100)>0
y = np.random.normal(10,2,100) + 2*x
d= pd.DataFrame({
    'y': y,
    'x': x
})
d['x']=d['x'].astype(int)
d
y x
0 10.091945 0
1 8.620043 1
2 14.530024 0
3 16.894454 1
4 10.864606 1
... ... ...
95 8.241021 0
96 7.561387 0
97 6.688701 0
98 12.406282 1
99 8.094310 0

100 rows × 2 columns

d.describe()
y x
count 100.000000 100.000000
mean 11.124978 0.490000
std 2.509474 0.502418
min 5.701657 0.000000
25% 9.310256 0.000000
50% 10.770151 0.000000
75% 12.717417 1.000000
max 17.218798 1.000000
d[d['x']==0]['y'].mean()
10.18727989115969
d[d['x']==1]['y'].mean()
12.100950474439491
d[d['x']==0]['y'].mean() + 2.2391
12.42637989115969
ols('y ~ x', d).fit().summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.147
Model: OLS Adj. R-squared: 0.138
Method: Least Squares F-statistic: 16.86
Date: Thu, 30 Nov 2023 Prob (F-statistic): 8.34e-05
Time: 19:50:08 Log-Likelihood: -225.46
No. Observations: 100 AIC: 454.9
Df Residuals: 98 BIC: 460.1
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 10.1873 0.326 31.227 0.000 9.540 10.835
x 1.9137 0.466 4.106 0.000 0.989 2.839
Omnibus: 3.658 Durbin-Watson: 2.063
Prob(Omnibus): 0.161 Jarque-Bera (JB): 3.568
Skew: 0.458 Prob(JB): 0.168
Kurtosis: 2.863 Cond. No. 2.60


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
sns.scatterplot(x='x', y='y', data=d)
<Axes: xlabel='x', ylabel='y'>
../_images/fb5fd231d8924b4f76a6b4a65b20b406e211157c25b6dd827cfaef0551c69ac4.png

33.3. Logistic regression#

We will consider this dataset:

https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data

data=pd.read_csv('heart_disease_uci.csv')
data.head()
id age sex dataset cp trestbps chol fbs restecg thalch exang oldpeak slope ca thal num
0 1 63 Male Cleveland typical angina 145.0 233.0 True lv hypertrophy 150.0 False 2.3 downsloping 0.0 fixed defect 0
1 2 67 Male Cleveland asymptomatic 160.0 286.0 False lv hypertrophy 108.0 True 1.5 flat 3.0 normal 2
2 3 67 Male Cleveland asymptomatic 120.0 229.0 False lv hypertrophy 129.0 True 2.6 flat 2.0 reversable defect 1
3 4 37 Male Cleveland non-anginal 130.0 250.0 False normal 187.0 False 3.5 downsloping 0.0 normal 0
4 5 41 Female Cleveland atypical angina 130.0 204.0 False lv hypertrophy 172.0 False 1.4 upsloping 0.0 normal 0
data['attack'] = (data['num']>0).astype(int)
data.describe()
id age trestbps chol thalch oldpeak ca num attack
count 920.000000 920.000000 861.000000 890.000000 865.000000 858.000000 309.000000 920.000000 920.000000
mean 460.500000 53.510870 132.132404 199.130337 137.545665 0.878788 0.676375 0.995652 0.553261
std 265.725422 9.424685 19.066070 110.780810 25.926276 1.091226 0.935653 1.142693 0.497426
min 1.000000 28.000000 0.000000 0.000000 60.000000 -2.600000 0.000000 0.000000 0.000000
25% 230.750000 47.000000 120.000000 175.000000 120.000000 0.000000 0.000000 0.000000 0.000000
50% 460.500000 54.000000 130.000000 223.000000 140.000000 0.500000 0.000000 1.000000 1.000000
75% 690.250000 60.000000 140.000000 268.000000 157.000000 1.500000 1.000000 2.000000 1.000000
max 920.000000 77.000000 200.000000 603.000000 202.000000 6.200000 3.000000 4.000000 1.000000
from statsmodels.formula.api import logit
logit("attack ~ sex", data).fit().summary()
Optimization terminated successfully.
         Current function value: 0.639394
         Iterations 5
Logit Regression Results
Dep. Variable: attack No. Observations: 920
Model: Logit Df Residuals: 918
Method: MLE Df Model: 1
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.06992
Time: 19:51:22 Log-Likelihood: -588.24
converged: True LL-Null: -632.47
Covariance Type: nonrobust LLR p-value: 5.220e-21
coef std err z P>|z| [0.025 0.975]
Intercept -1.0578 0.164 -6.444 0.000 -1.380 -0.736
sex[T.Male] 1.5996 0.181 8.823 0.000 1.244 1.955

The coefficient of sex[T.Male] is telling us that males have a risk of heart attack \(e^{1.5996} \approx 4.95\) times higher than females.

To compute a multiple logistic regressor, let us first extend the dataset with dummy variables:

data2 = pd.get_dummies(data.dropna(), columns=["sex", "dataset", "cp", "fbs", "restecg", "exang", "slope", "thal"], drop_first=True)

We will use the non-forumla API:

from statsmodels.api import Logit
data2['Intercept']=1 # we need to add the intercept manually when using this API
Logit(data2['attack'], data2.drop(['dataset','attack','num','id'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.321016
         Iterations 8
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 280
Method: MLE Df Model: 18
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5352
Time: 20:04:07 Log-Likelihood: -95.984
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 5.920e-37
coef std err z P>|z| [0.025 0.975]
age -0.0137 0.025 -0.554 0.579 -0.062 0.035
trestbps 0.0244 0.011 2.170 0.030 0.002 0.046
chol 0.0042 0.004 1.064 0.287 -0.004 0.012
thalch -0.0180 0.011 -1.625 0.104 -0.040 0.004
oldpeak 0.3610 0.231 1.566 0.117 -0.091 0.813
ca 1.3099 0.280 4.682 0.000 0.762 1.858
sex_Male 1.5500 0.531 2.920 0.003 0.510 2.590
cp_atypical angina -0.8454 0.561 -1.508 0.132 -1.944 0.253
cp_non-anginal -1.8501 0.501 -3.690 0.000 -2.833 -0.867
cp_typical angina -2.1010 0.667 -3.151 0.002 -3.408 -0.794
fbs_True -0.5952 0.609 -0.977 0.329 -1.789 0.599
restecg_normal -0.4695 0.384 -1.222 0.222 -1.222 0.283
restecg_st-t abnormality 0.3128 2.439 0.128 0.898 -4.467 5.092
exang_True 0.7190 0.440 1.634 0.102 -0.143 1.581
slope_flat 0.6474 0.851 0.761 0.447 -1.020 2.315
slope_upsloping -0.5167 0.922 -0.560 0.575 -2.325 1.291
thal_normal 0.0313 0.790 0.040 0.968 -1.517 1.579
thal_reversable defect 1.4329 0.775 1.850 0.064 -0.085 2.951
Intercept -2.8907 2.847 -1.015 0.310 -8.472 2.690

Let us proceed removing the variables with largest p-values wih backward elimination. We’ll drop thal_normal:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','thal_normal'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.321018
         Iterations 8
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 281
Method: MLE Df Model: 17
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5352
Time: 20:05:21 Log-Likelihood: -95.984
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 1.611e-37
coef std err z P>|z| [0.025 0.975]
age -0.0137 0.025 -0.555 0.579 -0.062 0.035
trestbps 0.0244 0.011 2.171 0.030 0.002 0.046
chol 0.0042 0.004 1.067 0.286 -0.004 0.012
thalch -0.0180 0.011 -1.636 0.102 -0.040 0.004
oldpeak 0.3604 0.230 1.567 0.117 -0.090 0.811
ca 1.3098 0.280 4.682 0.000 0.761 1.858
sex_Male 1.5445 0.512 3.016 0.003 0.541 2.548
cp_atypical angina -0.8443 0.560 -1.508 0.132 -1.942 0.253
cp_non-anginal -1.8479 0.498 -3.709 0.000 -2.824 -0.871
cp_typical angina -2.0968 0.658 -3.186 0.001 -3.387 -0.807
fbs_True -0.5979 0.605 -0.988 0.323 -1.784 0.589
restecg_normal -0.4707 0.383 -1.229 0.219 -1.221 0.280
restecg_st-t abnormality 0.3147 2.439 0.129 0.897 -4.465 5.094
exang_True 0.7186 0.440 1.634 0.102 -0.143 1.581
slope_flat 0.6484 0.850 0.763 0.446 -1.018 2.314
slope_upsloping -0.5133 0.919 -0.559 0.576 -2.314 1.287
thal_reversable defect 1.4068 0.405 3.471 0.001 0.612 2.201
Intercept -2.8676 2.786 -1.029 0.303 -8.329 2.593

Let us remove restecg_st-t abnormality:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.321046
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 282
Method: MLE Df Model: 16
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5352
Time: 20:05:33 Log-Likelihood: -95.993
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 4.281e-38
coef std err z P>|z| [0.025 0.975]
age -0.0137 0.025 -0.553 0.581 -0.062 0.035
trestbps 0.0245 0.011 2.187 0.029 0.003 0.046
chol 0.0042 0.004 1.061 0.289 -0.004 0.012
thalch -0.0181 0.011 -1.650 0.099 -0.040 0.003
oldpeak 0.3633 0.229 1.588 0.112 -0.085 0.812
ca 1.3091 0.280 4.680 0.000 0.761 1.857
sex_Male 1.5371 0.508 3.023 0.003 0.541 2.534
cp_atypical angina -0.8440 0.560 -1.507 0.132 -1.942 0.254
cp_non-anginal -1.8463 0.498 -3.705 0.000 -2.823 -0.870
cp_typical angina -2.0996 0.658 -3.192 0.001 -3.389 -0.810
fbs_True -0.6002 0.605 -0.991 0.321 -1.787 0.586
restecg_normal -0.4752 0.381 -1.246 0.213 -1.223 0.272
exang_True 0.7188 0.440 1.634 0.102 -0.143 1.581
slope_flat 0.6525 0.850 0.768 0.443 -1.014 2.319
slope_upsloping -0.5079 0.918 -0.553 0.580 -2.308 1.292
thal_reversable defect 1.4075 0.405 3.474 0.001 0.614 2.202
Intercept -2.8552 2.784 -1.025 0.305 -8.312 2.602

We now remove age:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal','age'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.321559
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 283
Method: MLE Df Model: 15
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5344
Time: 20:05:46 Log-Likelihood: -96.146
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 1.261e-38
coef std err z P>|z| [0.025 0.975]
trestbps 0.0228 0.011 2.126 0.034 0.002 0.044
chol 0.0039 0.004 1.002 0.316 -0.004 0.011
thalch -0.0159 0.010 -1.561 0.119 -0.036 0.004
oldpeak 0.3710 0.228 1.626 0.104 -0.076 0.818
ca 1.2705 0.269 4.718 0.000 0.743 1.798
sex_Male 1.5667 0.504 3.106 0.002 0.578 2.555
cp_atypical angina -0.8593 0.560 -1.536 0.125 -1.956 0.237
cp_non-anginal -1.8614 0.498 -3.738 0.000 -2.837 -0.885
cp_typical angina -2.1148 0.656 -3.224 0.001 -3.400 -0.829
fbs_True -0.6053 0.602 -1.006 0.315 -1.785 0.574
restecg_normal -0.4670 0.381 -1.227 0.220 -1.213 0.279
exang_True 0.7299 0.438 1.666 0.096 -0.129 1.589
slope_flat 0.6455 0.851 0.758 0.448 -1.023 2.314
slope_upsloping -0.5038 0.920 -0.548 0.584 -2.307 1.299
thal_reversable defect 1.3990 0.404 3.466 0.001 0.608 2.190
Intercept -3.6527 2.381 -1.534 0.125 -8.319 1.013

It’s now the turn of slope_upsloping:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal','age','slope_upsloping'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.322049
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 284
Method: MLE Df Model: 14
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5337
Time: 20:05:54 Log-Likelihood: -96.293
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 3.566e-39
coef std err z P>|z| [0.025 0.975]
trestbps 0.0227 0.011 2.116 0.034 0.002 0.044
chol 0.0037 0.004 0.970 0.332 -0.004 0.011
thalch -0.0164 0.010 -1.619 0.105 -0.036 0.003
oldpeak 0.4256 0.205 2.078 0.038 0.024 0.827
ca 1.2458 0.264 4.715 0.000 0.728 1.764
sex_Male 1.5501 0.502 3.087 0.002 0.566 2.534
cp_atypical angina -0.8492 0.558 -1.521 0.128 -1.943 0.245
cp_non-anginal -1.8661 0.498 -3.744 0.000 -2.843 -0.889
cp_typical angina -2.1161 0.654 -3.235 0.001 -3.398 -0.834
fbs_True -0.5632 0.594 -0.947 0.343 -1.728 0.602
restecg_normal -0.4814 0.379 -1.269 0.205 -1.225 0.262
exang_True 0.7304 0.438 1.669 0.095 -0.127 1.588
slope_flat 1.0485 0.429 2.442 0.015 0.207 1.890
thal_reversable defect 1.4014 0.404 3.472 0.001 0.610 2.193
Intercept -3.9750 2.302 -1.727 0.084 -8.487 0.537

Let’s remove fbs_True:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal','age','slope_upsloping','fbs_True'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.323582
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 285
Method: MLE Df Model: 13
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5315
Time: 20:06:06 Log-Likelihood: -96.751
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 1.307e-39
coef std err z P>|z| [0.025 0.975]
trestbps 0.0210 0.011 1.998 0.046 0.000 0.042
chol 0.0036 0.004 0.936 0.349 -0.004 0.011
thalch -0.0164 0.010 -1.628 0.103 -0.036 0.003
oldpeak 0.4380 0.205 2.135 0.033 0.036 0.840
ca 1.1942 0.256 4.664 0.000 0.692 1.696
sex_Male 1.4928 0.498 3.000 0.003 0.518 2.468
cp_atypical angina -0.8841 0.554 -1.596 0.110 -1.970 0.201
cp_non-anginal -1.9746 0.489 -4.039 0.000 -2.933 -1.016
cp_typical angina -2.1802 0.653 -3.340 0.001 -3.460 -0.901
restecg_normal -0.4735 0.379 -1.250 0.211 -1.216 0.269
exang_True 0.6861 0.435 1.578 0.115 -0.166 1.539
slope_flat 1.0228 0.428 2.391 0.017 0.184 1.861
thal_reversable defect 1.4331 0.402 3.564 0.000 0.645 2.221
Intercept -3.6750 2.257 -1.628 0.104 -8.099 0.749

Let’s remove chol:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal','age','slope_upsloping','fbs_True','chol'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.325012
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 286
Method: MLE Df Model: 12
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5294
Time: 20:06:19 Log-Likelihood: -97.179
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 4.481e-40
coef std err z P>|z| [0.025 0.975]
trestbps 0.0215 0.010 2.045 0.041 0.001 0.042
thalch -0.0152 0.010 -1.538 0.124 -0.035 0.004
oldpeak 0.4460 0.205 2.174 0.030 0.044 0.848
ca 1.1928 0.255 4.686 0.000 0.694 1.692
sex_Male 1.3485 0.467 2.889 0.004 0.434 2.263
cp_atypical angina -0.8627 0.551 -1.565 0.118 -1.943 0.218
cp_non-anginal -1.9666 0.488 -4.026 0.000 -2.924 -1.009
cp_typical angina -2.1905 0.651 -3.365 0.001 -3.466 -0.915
restecg_normal -0.5380 0.371 -1.448 0.147 -1.266 0.190
exang_True 0.6821 0.432 1.579 0.114 -0.164 1.529
slope_flat 1.0435 0.427 2.446 0.014 0.207 1.880
thal_reversable defect 1.4706 0.402 3.659 0.000 0.683 2.258
Intercept -2.9396 2.092 -1.405 0.160 -7.039 1.160

Let’s remove restecg_normal:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal','age','slope_upsloping','fbs_True','chol','restecg_normal'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.328559
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 287
Method: MLE Df Model: 11
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5243
Time: 20:06:47 Log-Likelihood: -98.239
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 2.704e-40
coef std err z P>|z| [0.025 0.975]
trestbps 0.0225 0.010 2.196 0.028 0.002 0.043
thalch -0.0152 0.010 -1.530 0.126 -0.035 0.004
oldpeak 0.4357 0.200 2.183 0.029 0.045 0.827
ca 1.2152 0.253 4.806 0.000 0.720 1.711
sex_Male 1.3796 0.466 2.961 0.003 0.466 2.293
cp_atypical angina -0.8622 0.550 -1.567 0.117 -1.940 0.216
cp_non-anginal -1.9589 0.485 -4.038 0.000 -2.910 -1.008
cp_typical angina -2.1394 0.646 -3.313 0.001 -3.405 -0.874
exang_True 0.6737 0.432 1.560 0.119 -0.173 1.520
slope_flat 1.1014 0.422 2.608 0.009 0.274 1.929
thal_reversable defect 1.4078 0.395 3.567 0.000 0.634 2.181
Intercept -3.3894 2.056 -1.648 0.099 -7.420 0.641

Let us now remove exang_True:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal','age','slope_upsloping','fbs_True','chol','restecg_normal','exang_True'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.332571
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 288
Method: MLE Df Model: 10
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5185
Time: 20:07:11 Log-Likelihood: -99.439
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 1.791e-40
coef std err z P>|z| [0.025 0.975]
trestbps 0.0233 0.010 2.278 0.023 0.003 0.043
thalch -0.0180 0.010 -1.845 0.065 -0.037 0.001
oldpeak 0.4647 0.197 2.354 0.019 0.078 0.852
ca 1.2033 0.252 4.781 0.000 0.710 1.697
sex_Male 1.3780 0.458 3.006 0.003 0.480 2.276
cp_atypical angina -1.0592 0.537 -1.973 0.048 -2.111 -0.007
cp_non-anginal -2.1451 0.472 -4.549 0.000 -3.069 -1.221
cp_typical angina -2.3297 0.642 -3.626 0.000 -3.589 -1.071
slope_flat 1.1436 0.420 2.724 0.006 0.321 1.966
thal_reversable defect 1.4692 0.391 3.758 0.000 0.703 2.235
Intercept -2.8221 2.011 -1.404 0.160 -6.763 1.119

Let us now remove thalch:

Logit(data2['attack'], data2.drop(['dataset','attack','num','id','restecg_st-t abnormality','thal_normal','age','slope_upsloping','fbs_True','chol','restecg_normal','exang_True','thalch'],axis=1)).fit().summary()
Optimization terminated successfully.
         Current function value: 0.338536
         Iterations 7
Logit Regression Results
Dep. Variable: attack No. Observations: 299
Model: Logit Df Residuals: 289
Method: MLE Df Model: 9
Date: Thu, 30 Nov 2023 Pseudo R-squ.: 0.5099
Time: 20:07:42 Log-Likelihood: -101.22
converged: True LL-Null: -206.51
Covariance Type: nonrobust LLR p-value: 1.996e-40
coef std err z P>|z| [0.025 0.975]
trestbps 0.0220 0.010 2.145 0.032 0.002 0.042
oldpeak 0.5145 0.194 2.659 0.008 0.135 0.894
ca 1.2539 0.252 4.985 0.000 0.761 1.747
sex_Male 1.3382 0.455 2.942 0.003 0.447 2.230
cp_atypical angina -1.2521 0.521 -2.403 0.016 -2.274 -0.231
cp_non-anginal -2.3023 0.464 -4.966 0.000 -3.211 -1.394
cp_typical angina -2.5791 0.644 -4.004 0.000 -3.842 -1.317
slope_flat 1.3792 0.397 3.470 0.001 0.600 2.158
thal_reversable defect 1.5143 0.388 3.907 0.000 0.755 2.274
Intercept -5.4495 1.485 -3.669 0.000 -8.361 -2.538

Al p-values are now below the significance level \(\alpha=0.05\). We can now proceed interpreting the result.