Insurance Premium Prediction Analysis¶

Problem description¶

To do the EDA to find the important factors affecting the premium price.
Based on the user input, need to predict the premium price.

Importing Python Libraries¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels.api as sm
from scipy.stats import stats, shapiro, boxcox, mannwhitneyu, chi2_contingency
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

Loading the Dataset¶

df = pd.read_csv("insurance.csv")

Preliminary Analysis¶

df.shape

(986, 11)

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   Age                      986 non-null    int64
 1   Diabetes                 986 non-null    int64
 2   BloodPressureProblems    986 non-null    int64
 3   AnyTransplants           986 non-null    int64
 4   AnyChronicDiseases       986 non-null    int64
 5   Height                   986 non-null    int64
 6   Weight                   986 non-null    int64
 7   KnownAllergies           986 non-null    int64
 8   HistoryOfCancerInFamily  986 non-null    int64
 9   NumberOfMajorSurgeries   986 non-null    int64
 10  PremiumPrice             986 non-null    int64
dtypes: int64(11)
memory usage: 84.9 KB

df.describe()

df.isnull().sum()

Age                        0
Diabetes                   0
BloodPressureProblems      0
AnyTransplants             0
AnyChronicDiseases         0
Height                     0
Weight                     0
KnownAllergies             0
HistoryOfCancerInFamily    0
NumberOfMajorSurgeries     0
PremiumPrice               0
dtype: int64

It is clear that there are no missing values in the dataset.

Outlier Detection¶

columns = df.select_dtypes(include='number')
for col in columns:
    q1, q3 = df[col].quantile([0.25, 0.75])
    iqr = q3 - q1
    lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    count = ((df[col] < lower) | (df[col] > upper)).sum()
    print(f"{col}: {count} outliers")

Age: 0 outliers
Diabetes: 0 outliers
BloodPressureProblems: 0 outliers
AnyTransplants: 55 outliers
AnyChronicDiseases: 178 outliers
Height: 0 outliers
Weight: 16 outliers
KnownAllergies: 212 outliers
HistoryOfCancerInFamily: 116 outliers
NumberOfMajorSurgeries: 16 outliers
PremiumPrice: 6 outliers

for col in columns:
    q1, q3 = df[col].quantile([0.25, 0.75])
    iqr = q3 - q1
    lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    print(f"Outliers in {col}:\n{outliers}\n{'-'*100}")

Outliers in Age:
Empty DataFrame
Columns: [Age, Diabetes, BloodPressureProblems, AnyTransplants, AnyChronicDiseases, Height, Weight, KnownAllergies, HistoryOfCancerInFamily, NumberOfMajorSurgeries, PremiumPrice]
Index: []
----------------------------------------------------------------------------------------------------
Outliers in Diabetes:
Empty DataFrame
Columns: [Age, Diabetes, BloodPressureProblems, AnyTransplants, AnyChronicDiseases, Height, Weight, KnownAllergies, HistoryOfCancerInFamily, NumberOfMajorSurgeries, PremiumPrice]
Index: []
----------------------------------------------------------------------------------------------------
Outliers in BloodPressureProblems:
Empty DataFrame
Columns: [Age, Diabetes, BloodPressureProblems, AnyTransplants, AnyChronicDiseases, Height, Weight, KnownAllergies, HistoryOfCancerInFamily, NumberOfMajorSurgeries, PremiumPrice]
Index: []
----------------------------------------------------------------------------------------------------
Outliers in AnyTransplants:
     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
14    18         0                      0               1                   0   
26    22         0                      0               1                   0   
39    24         1                      1               1                   0   
73    25         1                      1               1                   1   
79    30         0                      0               1                   1   
86    46         0                      1               1                   0   
101   45         1                      0               1                   0   
125   31         1                      0               1                   0   
139   48         0                      1               1                   1   
147   59         0                      1               1                   0   
169   42         1                      0               1                   0   
173   47         1                      0               1                   0   
187   25         1                      1               1                   0   
218   62         0                      1               1                   0   
289   64         0                      1               1                   0   
294   40         0                      0               1                   0   
314   59         1                      1               1                   0   
318   55         1                      0               1                   0   
369   36         0                      1               1                   1   
395   28         1                      1               1                   1   
396   33         0                      1               1                   1   
438   24         1                      0               1                   0   
439   65         0                      1               1                   0   
453   58         0                      1               1                   0   
459   58         0                      0               1                   0   
464   24         1                      1               1                   0   
504   52         1                      0               1                   0   
528   51         0                      0               1                   0   
571   43         0                      1               1                   1   
621   32         0                      1               1                   0   
624   34         0                      0               1                   0   
666   50         0                      1               1                   0   
670   36         0                      0               1                   1   
683   59         0                      0               1                   1   
708   56         1                      0               1                   0   
715   32         0                      0               1                   1   
740   27         0                      0               1                   1   
746   24         0                      0               1                   0   
788   30         0                      0               1                   0   
801   24         1                      0               1                   0   
804   44         0                      1               1                   0   
817   65         1                      0               1                   0   
841   56         1                      0               1                   0   
863   25         0                      0               1                   0   
865   66         0                      1               1                   0   
906   55         1                      0               1                   0   
915   54         0                      1               1                   1   
922   33         0                      0               1                   0   
927   31         0                      0               1                   0   
932   42         0                      0               1                   0   
938   31         0                      0               1                   1   
951   25         0                      0               1                   0   
952   44         1                      0               1                   0   
978   40         0                      1               1                   0   
980   40         0                      1               1                   0   

     Height  Weight  KnownAllergies  HistoryOfCancerInFamily  \
14      150      76               0                        0   
26      151      97               0                        0   
39      168      91               1                        0   
73      179      68               0                        0   
79      166      87               0                        0   
86      152      94               0                        1   
101     152      91               0                        0   
125     187      95               1                        0   
139     176      63               0                        0   
147     149      68               0                        0   
169     170      80               0                        0   
173     168      83               0                        0   
187     157     109               0                        0   
218     164     121               1                        0   
289     176      71               0                        1   
294     164      87               0                        0   
314     158      78               0                        0   
318     159      85               0                        0   
369     164      58               0                        0   
395     156      58               0                        0   
396     159      68               0                        0   
438     173      57               0                        0   
439     179      58               0                        0   
453     163      51               1                        0   
459     179      62               0                        0   
464     166      74               1                        0   
504     183      86               0                        0   
528     181      94               1                        0   
571     177      89               0                        0   
621     181      84               0                        0   
624     184      89               1                        0   
666     176      54               1                        0   
670     166      79               0                        0   
683     179      62               0                        0   
708     185     100               0                        0   
715     171      57               0                        1   
740     168      83               0                        0   
746     176      83               0                        0   
788     158      81               0                        1   
801     162      78               0                        0   
804     151      85               1                        0   
817     171      94               0                        0   
841     149      65               0                        0   
863     175      74               0                        0   
865     154      66               1                        0   
906     170      70               0                        0   
915     156      73               0                        0   
922     162      70               0                        0   
927     154      73               0                        1   
932     156      75               1                        0   
938     174      66               0                        0   
951     161      69               1                        0   
952     174      66               0                        0   
978     168      70               0                        0   
980     171      74               0                        0   

     NumberOfMajorSurgeries  PremiumPrice  
14                        1         15000  
26                        0         15000  
39                        0         15000  
73                        0         38000  
79                        0         38000  
86                        1         38000  
101                       0         38000  
125                       0         38000  
139                       0         38000  
147                       0         38000  
169                       1         38000  
173                       1         38000  
187                       0         15000  
218                       1         38000  
289                       1         38000  
294                       0         38000  
314                       1         38000  
318                       0         38000  
369                       0         38000  
395                       1         21000  
396                       1         38000  
438                       0         15000  
439                       1         35000  
453                       2         28000  
459                       0         38000  
464                       0         15000  
504                       0         38000  
528                       1         38000  
571                       1         38000  
621                       0         38000  
624                       1         38000  
666                       2         30000  
670                       0         38000  
683                       0         38000  
708                       0         38000  
715                       1         38000  
740                       0         38000  
746                       0         15000  
788                       1         27000  
801                       1         15000  
804                       1         38000  
817                       3         28000  
841                       2         28000  
863                       0         15000  
865                       1         35000  
906                       2         28000  
915                       2         28000  
922                       0         38000  
927                       1         38000  
932                       1         38000  
938                       1         38000  
951                       1         15000  
952                       1         38000  
978                       0         17000  
980                       0         38000  
----------------------------------------------------------------------------------------------------
Outliers in AnyChronicDiseases:
     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
3     52         1                      1               0                   1   
4     38         0                      0               0                   1   
16    42         0                      0               0                   1   
28    30         0                      0               0                   1   
29    33         1                      1               0                   1   
..   ...       ...                    ...             ...                 ...   
938   31         0                      0               1                   1   
940   40         1                      1               0                   1   
942   25         1                      1               0                   1   
966   66         0                      1               0                   1   
977   45         0                      1               0                   1   

     Height  Weight  KnownAllergies  HistoryOfCancerInFamily  \
3       183      93               0                        0   
4       166      88               0                        0   
16      149      67               0                        0   
28      162      73               1                        0   
29      153      58               0                        0   
..      ...     ...             ...                      ...   
938     174      66               0                        0   
940     155      66               0                        0   
942     151      70               0                        0   
966     176      71               1                        0   
977     157      67               0                        0   

     NumberOfMajorSurgeries  PremiumPrice  
3                         2         28000  
4                         1         23000  
16                        0         30000  
28                        0         23000  
29                        0         21000  
..                      ...           ...  
938                       1         38000  
940                       1         30000  
942                       0         19000  
966                       1         35000  
977                       1         25000  

[178 rows x 11 columns]
----------------------------------------------------------------------------------------------------
Outliers in Height:
Empty DataFrame
Columns: [Age, Diabetes, BloodPressureProblems, AnyTransplants, AnyChronicDiseases, Height, Weight, KnownAllergies, HistoryOfCancerInFamily, NumberOfMajorSurgeries, PremiumPrice]
Index: []
----------------------------------------------------------------------------------------------------
Outliers in Weight:
     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
154   21         0                      0               0                   0   
158   43         0                      0               0                   0   
183   36         0                      0               0                   0   
186   19         0                      0               0                   0   
192   50         1                      0               0                   0   
195   19         0                      0               0                   0   
196   39         0                      1               0                   0   
203   24         0                      1               0                   0   
204   27         0                      1               0                   0   
207   18         0                      0               0                   0   
217   65         0                      1               0                   0   
218   62         0                      1               1                   0   
221   61         0                      1               0                   1   
225   57         1                      0               0                   0   
227   44         0                      0               0                   0   
228   22         1                      1               0                   0   

     Height  Weight  KnownAllergies  HistoryOfCancerInFamily  \
154     157     118               1                        0   
158     158     121               0                        0   
183     156     119               0                        0   
186     173     129               1                        0   
192     163     127               0                        0   
195     164     132               0                        0   
196     174     120               0                        0   
203     173     128               0                        0   
204     159     120               1                        1   
207     172     123               0                        1   
217     177     126               0                        0   
218     164     121               1                        0   
221     174     118               0                        0   
225     160     128               0                        0   
227     182     124               0                        1   
228     166     122               0                        0   

     NumberOfMajorSurgeries  PremiumPrice  
154                       1         15000  
158                       0         23000  
183                       0         23000  
186                       0         15000  
192                       2         28000  
195                       0         15000  
196                       0         23000  
203                       1         26000  
204                       1         39000  
207                       1         15000  
217                       2         24000  
218                       1         38000  
221                       1         35000  
225                       0         35000  
227                       1         31000  
228                       0         15000  
----------------------------------------------------------------------------------------------------
Outliers in KnownAllergies:
     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
5     30         0                      0               0                   0   
7     23         0                      0               0                   0   
8     48         1                      0               0                   0   
12    24         0                      0               0                   0   
15    38         0                      0               0                   0   
..   ...       ...                    ...             ...                 ...   
966   66         0                      1               0                   1   
967   42         1                      1               0                   0   
972   31         0                      1               0                   0   
984   47         1                      1               0                   0   
985   21         0                      0               0                   0   

     Height  Weight  KnownAllergies  HistoryOfCancerInFamily  \
5       160      69               1                        0   
7       181      79               1                        0   
8       169      74               1                        0   
12      178      57               1                        0   
15      160      68               1                        0   
..      ...     ...             ...                      ...   
966     176      71               1                        0   
967     152      67               1                        0   
972     152      75               1                        0   
984     158      73               1                        0   
985     158      75               1                        0   

     NumberOfMajorSurgeries  PremiumPrice  
5                         1         23000  
7                         0         15000  
8                         0         23000  
12                        1         15000  
15                        1         23000  
..                      ...           ...  
966                       1         35000  
967                       1         23000  
972                       1         23000  
984                       1         39000  
985                       1         15000  

[212 rows x 11 columns]
----------------------------------------------------------------------------------------------------
Outliers in HistoryOfCancerInFamily:
     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
24    53         0                      1               0                   0   
38    43         0                      0               0                   1   
42    25         0                      1               0                   0   
67    19         0                      0               0                   0   
85    31         0                      0               0                   1   
..   ...       ...                    ...             ...                 ...   
933   54         0                      1               0                   0   
934   54         1                      1               0                   0   
935   38         0                      1               0                   1   
943   53         0                      0               0                   0   
961   59         1                      1               0                   0   

     Height  Weight  KnownAllergies  HistoryOfCancerInFamily  \
24      151      97               0                        1   
38      173      81               0                        1   
42      184      55               0                        1   
67      148      60               0                        1   
85      150      81               1                        1   
..      ...     ...             ...                      ...   
933     161      69               0                        1   
934     160      70               0                        1   
935     170      71               0                        1   
943     173      66               0                        1   
961     154      66               0                        1   

     NumberOfMajorSurgeries  PremiumPrice  
24                        1         35000  
38                        1         30000  
42                        1         15000  
67                        1         15000  
85                        1         25000  
..                      ...           ...  
933                       1         25000  
934                       1         31000  
935                       1         31000  
943                       1         25000  
961                       1         25000  

[116 rows x 11 columns]
----------------------------------------------------------------------------------------------------
Outliers in NumberOfMajorSurgeries:
     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
34    64         1                      0               0                   0   
91    63         1                      1               0                   0   
109   64         1                      0               0                   0   
274   64         1                      0               0                   0   
282   64         1                      0               0                   0   
309   64         1                      0               0                   0   
401   65         1                      1               0                   0   
458   61         1                      0               0                   0   
463   65         1                      0               0                   0   
605   62         0                      0               0                   0   
817   65         1                      0               1                   0   
818   65         1                      0               0                   0   
860   61         1                      0               0                   0   
953   62         1                      1               0                   0   
954   63         1                      1               0                   0   
982   64         1                      1               0                   0   

     Height  Weight  KnownAllergies  HistoryOfCancerInFamily  \
34      172      85               0                        0   
91      175      95               0                        0   
109     187      53               0                        0   
274     163      88               0                        0   
282     175      81               0                        0   
309     177      94               0                        0   
401     167      52               0                        0   
458     178      71               0                        0   
463     184      78               0                        0   
605     177      97               0                        0   
817     171      94               0                        0   
818     165      99               0                        0   
860     161      66               0                        0   
953     157      66               0                        0   
954     158      73               0                        0   
982     153      70               0                        0   

     NumberOfMajorSurgeries  PremiumPrice  
34                        3         28000  
91                        3         28000  
109                       3         28000  
274                       3         28000  
282                       3         28000  
309                       3         28000  
401                       3         28000  
458                       3         28000  
463                       3         28000  
605                       3         28000  
817                       3         28000  
818                       3         28000  
860                       3         28000  
953                       3         28000  
954                       3         28000  
982                       3         28000  
----------------------------------------------------------------------------------------------------
Outliers in PremiumPrice:
     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
204   27         0                      1               0                   0   
295   64         1                      1               0                   1   
926   24         0                      1               0                   0   
928   19         0                      0               0                   0   
976   21         0                      1               0                   0   
984   47         1                      1               0                   0   

     Height  Weight  KnownAllergies  HistoryOfCancerInFamily  \
204     159     120               1                        1   
295     163      91               0                        0   
926     159      67               0                        0   
928     171      67               0                        0   
976     155      74               0                        0   
984     158      73               1                        0   

     NumberOfMajorSurgeries  PremiumPrice  
204                       1         39000  
295                       2         40000  
926                       0         39000  
928                       1         39000  
976                       0         39000  
984                       1         39000  
----------------------------------------------------------------------------------------------------

out_cols = ['AnyTransplants', 'AnyChronicDiseases', 'Weight', 'KnownAllergies', 'HistoryOfCancerInFamily', 'NumberOfMajorSurgeries', 'PremiumPrice']
for cols in out_cols:
    print('_'*100)
    print(df[cols].value_counts())

____________________________________________________________________________________________________
AnyTransplants
0    931
1     55
Name: count, dtype: int64
____________________________________________________________________________________________________
AnyChronicDiseases
0    808
1    178
Name: count, dtype: int64
____________________________________________________________________________________________________
Weight
73     43
75     41
74     38
70     34
67     31
       ..
126     1
124     1
105     1
102     1
122     1
Name: count, Length: 74, dtype: int64
____________________________________________________________________________________________________
KnownAllergies
0    774
1    212
Name: count, dtype: int64
____________________________________________________________________________________________________
HistoryOfCancerInFamily
0    870
1    116
Name: count, dtype: int64
____________________________________________________________________________________________________
NumberOfMajorSurgeries
0    479
1    372
2    119
3     16
Name: count, dtype: int64
____________________________________________________________________________________________________
PremiumPrice
23000    249
15000    202
28000    132
25000    103
29000     72
30000     47
35000     41
38000     34
31000     31
21000     26
19000     15
26000      7
39000      5
32000      4
24000      4
16000      3
36000      2
18000      2
34000      2
22000      1
20000      1
40000      1
27000      1
17000      1
Name: count, dtype: int64

From the above analysis, these outliers should to be removed.

Visual Analysis¶

Univariate Analysis¶

# Creating new feature BMI for the better analysis
df['BMI'] = round(df['Weight'] / ((df['Height'])/100)**2, 2)

# Creating BMI categories bin based on standard classifications
df['BMI_Category'] = pd.cut(df['BMI'], 
                           bins=[0, 18.5, 25, 30, float('inf')], 
                           labels=['Underweight', 'Normal', 'Overweight', 'Obese'])
                           
df['BMI_Category'].value_counts()

BMI_Category
Overweight     326
Normal         319
Obese          302
Underweight     39
Name: count, dtype: int64

clmns = ['Age', 'PremiumPrice', 'BMI']

for cols in clmns:
    plt.figure(figsize=(12, 10))
    sns.histplot(data=df, x=cols, kde=True, bins=30, color='blue')
    plt.title(f"{cols} Distribution")
    plt.show()

From the above graphs - Age and Premium Price are not normally distributed.
BMI seems normally distributed but slightly right skewed.
We will use statistical test to check and confirm the normality.

Multi-variate Analysis¶

fig = px.scatter(df, x='Age', y='PremiumPrice',  
                 title='Premium Price by Age and BMI',
                 color='BMI_Category', color_continuous_scale='Viridis')
fig.update_layout(title_x=0.5)
fig.show()

💡 Older - Obese and overweight people are got higher premium price.

💡 Younger - Normal and underweight people are got lower premium price.

cat_cols = ['Diabetes', 'BloodPressureProblems', 'AnyTransplants', 'AnyChronicDiseases', 'KnownAllergies', 'HistoryOfCancerInFamily']
for cols in cat_cols:
    df[cols] = df[cols].astype(str)
    fig = px.scatter(df, x='Age', y='PremiumPrice', color=cols, title=f'Premium Price by Age and {cols}', 
                        color_discrete_map={"0": 'red', "1": 'green'})
    fig.update_layout(title_x=0.5)
    fig.show()

Those people have gone for the transplant, got the higher premium price.

fig = px.scatter(df, x='Age', y='PremiumPrice', size="NumberOfMajorSurgeries", color="NumberOfMajorSurgeries", 
                 title='Premium Price by Age and Number of Major Surgeries', color_discrete_map={0: 'red', 1: 'green', 2: "blue", 3: "yellow"})
fig.update_layout(title_x=0.5)
fig.show()

People who are more than 60years old have gone through 3 major surgeries. And their premium prices are same 28K.
For 2 major surgeries, around the age of 50 to 60 years old, their premium amount is also same 28K.

Statistical Analysis¶

Pearson = df[['Age', 'PremiumPrice', "BMI", "NumberOfMajorSurgeries"]].corr('pearson')
Spearman = df[['Age', 'PremiumPrice', "BMI", "NumberOfMajorSurgeries"]].corr('spearman')

fig = px.imshow(Pearson, 
            text_auto='.2f',  
            color_continuous_scale='RdYlBu_r', 
            title='Pearson Correlation Heatmap',
            aspect='auto')
fig.update_layout(title_x=0.5, width=800, height=600)
fig.show()  

fig = px.imshow(Spearman, 
                text_auto='.2f',  
                color_continuous_scale='RdYlBu_r', 
                title='Spearman Correlation Heatmap',
                aspect='auto')
fig.update_layout(title_x=0.5, width=800, height=600)
fig.show()

Age and premium price are highly co-related.

num_cols = df.select_dtypes(include=['number']).columns.tolist()

n_cols = len(num_cols)
n_rows = (n_cols + 2) // 3 

fig, axes = plt.subplots(n_rows, 3, figsize=(15, 5*n_rows))
axes = axes.ravel()

for i, col in enumerate(num_cols):
    sm.qqplot(df[col], line='45', ax=axes[i])
    axes[i].set_title(f'Q-Q Plot of {col}')

for j in range(i+1, len(axes)):
    axes[j].set_visible(False)

plt.tight_layout()
plt.show()

for cols in num_cols:
    stat, p_value = shapiro(df[cols])
    print(f'Shapiro-Wilk Test for {cols}: Statistic={stat:.4f}, p-value={p_value:.4f}')

    if p_value > 0.05:
        print('Data appears to be normally distributed')
    else:
        print('Data does not appear to be normally distributed')
    print('_'*100)

Shapiro-Wilk Test for Age: Statistic=0.9589, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for Height: Statistic=0.9800, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for Weight: Statistic=0.9670, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for NumberOfMajorSurgeries: Statistic=0.7731, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for PremiumPrice: Statistic=0.9272, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for BMI: Statistic=0.9730, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________

From the above 2 tests, the premium price, age, height, weight and BMI are not normally distributed.

# Log transform
for cols in num_cols:
    df[f'{cols}_log'] = np.log(df[cols])

num_cols_log = ["PremiumPrice_log", "Age_log", "Height_log", "Weight_log", "BMI_log"]

for cols in num_cols_log:
    stat, p_value = shapiro(df[cols])
    print(f'Shapiro-Wilk Test for {cols}: Statistic={stat:.4f}, p-value={p_value:.4f}')

    if p_value > 0.05:
        print('Data appears to be normally distributed')
    else:
        print('Data does not appear to be normally distributed')
    print('_'*100)

Shapiro-Wilk Test for PremiumPrice_log: Statistic=0.8962, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for Age_log: Statistic=0.9451, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for Height_log: Statistic=0.9766, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for Weight_log: Statistic=0.9897, p-value=0.0000
Data does not appear to be normally distributed
____________________________________________________________________________________________________
Shapiro-Wilk Test for BMI_log: Statistic=0.9956, p-value=0.0065
Data does not appear to be normally distributed
____________________________________________________________________________________________________

After the log transform the data are still not normally distributed. So we will use the original data without transformation.

# Since the data is not normally distributed, we will use non-parametric tests for the statistical Analysis

# Instead of t-test, we use Mann-Whitney U test

cols_grps = ['Diabetes', 'BloodPressureProblems', 'AnyTransplants', 'AnyChronicDiseases', 'KnownAllergies', 'HistoryOfCancerInFamily']

for cols in cols_grps:
    group1 = df[df[cols] == 0]['PremiumPrice']
    group2 = df[df[cols] == 1]['PremiumPrice']

    statistic, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
    print(f'{cols} vs Premium price')
    print(f'  Mann-Whitney U Test: p-value = {p_value:.4f}')

    if p_value < 0.05:
        print(f'  Significant difference in Premium Price by {cols}')
    else:
        print(f'  No significant difference in Premium Price by {cols}')
    print('_' * 100)

Diabetes vs Premium price
  Mann-Whitney U Test: p-value = nan
  No significant difference in Premium Price by Diabetes
____________________________________________________________________________________________________
BloodPressureProblems vs Premium price
  Mann-Whitney U Test: p-value = nan
  No significant difference in Premium Price by BloodPressureProblems
____________________________________________________________________________________________________
AnyTransplants vs Premium price
  Mann-Whitney U Test: p-value = nan
  No significant difference in Premium Price by AnyTransplants
____________________________________________________________________________________________________
AnyChronicDiseases vs Premium price
  Mann-Whitney U Test: p-value = nan
  No significant difference in Premium Price by AnyChronicDiseases
____________________________________________________________________________________________________
KnownAllergies vs Premium price
  Mann-Whitney U Test: p-value = nan
  No significant difference in Premium Price by KnownAllergies
____________________________________________________________________________________________________
HistoryOfCancerInFamily vs Premium price
  Mann-Whitney U Test: p-value = nan
  No significant difference in Premium Price by HistoryOfCancerInFamily
____________________________________________________________________________________________________

From the statistical test, except the Known Allergies all other Diabetes, Blood pressure, Transplants, Chronic diseases and History of Cancer in Family have significant difference in Premium Price.

df['AgeGroup'] = pd.cut(df['Age'], bins=[0,30,50,100], labels=['Young','Middle','Senior'])

def grp (feature, df):
    groups_age = [group['PremiumPrice'].values for name, group in df.groupby(feature)]
    stat, p = stats.kruskal(*groups_age)
    print(f"Kruskal-Wallis Test for {feature}: H-stat =", stat, " p-value =", p)

    if p<0.05:
        print(f'Atleast one of the {feature} has a different Premium Median')
    else:
        print('No difference')

grp ('AgeGroup', df)
grp ('BMI_Category', df)

Kruskal-Wallis Test for AgeGroup: H-stat = 531.5472968196424  p-value = 3.766786977570815e-116
Atleast one of the AgeGroup has a different Premium Median
Kruskal-Wallis Test for BMI_Category: H-stat = 10.842646641302649  p-value = 0.012607896820871429
Atleast one of the BMI_Category has a different Premium Median

From the above analysis, both BMI and Age different groups have significant impact on the premium price.

df['PremiumPrice_Category'] = pd.cut(df['PremiumPrice'], 
                                    bins=3, 
                                    labels=['Low', 'Medium', 'High'])

cat_cols = ['Diabetes', 'BloodPressureProblems', 'AnyTransplants', 'AnyChronicDiseases', 'KnownAllergies',
                'HistoryOfCancerInFamily', 'BMI_Category', 'AgeGroup']

for cols in cat_cols:
    contingency_table = pd.crosstab(df[cols], df['PremiumPrice_Category'])
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)

    print(f'Chi-Square Test: {cols} vs Premium Price Category')
    print(f'Chi2 statistic: {chi2:.4f}')
    print(f'p-value: {p_value:.4f}')
    print(f'Degrees of freedom: {dof}')

    if p_value < 0.05:
        print(f'There is a significant association between {cols} and Premium Price Category')
    else:
        print('No significant association found')
    
    print('_'*100)

Chi-Square Test: Diabetes vs Premium Price Category
Chi2 statistic: 20.4125
p-value: 0.0000
Degrees of freedom: 2
There is a significant association between Diabetes and Premium Price Category
____________________________________________________________________________________________________
Chi-Square Test: BloodPressureProblems vs Premium Price Category
Chi2 statistic: 48.4761
p-value: 0.0000
Degrees of freedom: 2
There is a significant association between BloodPressureProblems and Premium Price Category
____________________________________________________________________________________________________
Chi-Square Test: AnyTransplants vs Premium Price Category
Chi2 statistic: 226.0377
p-value: 0.0000
Degrees of freedom: 2
There is a significant association between AnyTransplants and Premium Price Category
____________________________________________________________________________________________________
Chi-Square Test: AnyChronicDiseases vs Premium Price Category
Chi2 statistic: 32.3678
p-value: 0.0000
Degrees of freedom: 2
There is a significant association between AnyChronicDiseases and Premium Price Category
____________________________________________________________________________________________________
Chi-Square Test: KnownAllergies vs Premium Price Category
Chi2 statistic: 1.2943
p-value: 0.5235
Degrees of freedom: 2
No significant association found
____________________________________________________________________________________________________
Chi-Square Test: HistoryOfCancerInFamily vs Premium Price Category
Chi2 statistic: 28.4032
p-value: 0.0000
Degrees of freedom: 2
There is a significant association between HistoryOfCancerInFamily and Premium Price Category
____________________________________________________________________________________________________
Chi-Square Test: BMI_Category vs Premium Price Category
Chi2 statistic: 40.4440
p-value: 0.0000
Degrees of freedom: 6
There is a significant association between BMI_Category and Premium Price Category
____________________________________________________________________________________________________
Chi-Square Test: AgeGroup vs Premium Price Category
Chi2 statistic: 514.0678
p-value: 0.0000
Degrees of freedom: 4
There is a significant association between AgeGroup and Premium Price Category
____________________________________________________________________________________________________

Except the known allergies, then the rest are have significant association with the Premium price.

# We can drop the unnecessary columns created during the analysis, before the ML model development.
df.drop(df.columns[12:], axis=1, inplace = True)

ML Model¶

ML preprocessing¶

X = df.drop(columns=['PremiumPrice'], axis=1)
Y = df['PremiumPrice']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Training & Evaluation¶

models = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.01),
    "Random Forest": RandomForestRegressor(n_estimators=200, random_state=42)
}

results = {}

for name, model in models.items():
    if name in ["LinearRegression", "Ridge", "Lasso"]:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

    r2 = round(r2_score(y_test, y_pred), 4)
    rmse = round(np.sqrt(mean_squared_error(y_test, y_pred)), 4)
    mae = round(mean_absolute_error(y_test, y_pred), 4)
    n = len(Y)
    k = X.shape[1]
    adj_r2 = round(1 - ((1 - r2) * (n - 1) / (n - k - 1)), 4)
    results[name] = {"R2": r2, "Adj_R2":adj_r2, "RMSE": rmse, "MAE": mae}

results_df = pd.DataFrame(results).T
print("\n📊 Model Performance:\n", results_df)

📊 Model Performance:
                        R2  Adj_R2       RMSE        MAE
Linear Regression  0.7136  0.7104  3494.4047  2586.1486
Ridge              0.7134  0.7102  3495.6605  2587.1945
Lasso              0.7136  0.7104  3494.4117  2586.1526
Random Forest      0.8948  0.8936  2118.2476  1002.6768

# For Tableau
surrogate = LinearRegression()
surrogate.fit(X_train, model.predict(X_train))

print("Coefficients:", surrogate.coef_)
print("Intercept:", surrogate.intercept_)
print(X_train.columns)

Coefficients: [ 319.73200523 -313.54682517  177.36472489 7185.68555128 2436.2806527
  -40.2033987   102.75248781  178.53092951 2075.44990058 -631.07811845
  -96.76452111]
Intercept: 11794.419215082264
Index(['Age', 'Diabetes', 'BloodPressureProblems', 'AnyTransplants',
       'AnyChronicDiseases', 'Height', 'Weight', 'KnownAllergies',
       'HistoryOfCancerInFamily', 'NumberOfMajorSurgeries', 'BMI'],
      dtype='object')

results_df.reset_index(inplace = True)

sns.lineplot(data=results_df, x="index", y="R2", marker='o', label="R2")
sns.lineplot(data=results_df, x="index", y="Adj_R2", marker='o', label="Adj_R2")

plt.xticks(rotation=45)
plt.title("Model vs R2 & Adjusted R2")
plt.ylabel("Score")
plt.tight_layout()
plt.show()

sns.lineplot(data=results_df, x="index", y="MAE", marker='o', label="MAE")
sns.lineplot(data=results_df, x="index", y="RMSE", marker='o', label="RMSE")

plt.xticks(rotation=45)
plt.title("Model vs MAE & RMSE")
plt.ylabel("Score")
plt.tight_layout()
plt.show()

From the above analysis, Random Forest gives the best R2 and Adj_R2 score. So we will use this model for deployment.

import joblib

final_model = RandomForestRegressor(random_state=42)
final_model.fit(X_train, y_train)

joblib.dump(final_model, "premium_model.pkl")

['premium_model.pkl']

Overall Insights 💡¶

👉 Older - Obese and overweight people are got higher premium price.

👉 Younger - Normal and underweight people are got lower premium price.

👉 Those people have gone for any transplants surgery, got the higher premium price.

👉 People who are more than 60years old have gone through 3 major surgeries. And their premium prices are same 28K.

👉 For 2 major surgeries, around the age of 50 to 60 years old, their premium amount is also same 28K.

👉 Age and premium price are highly co-related.

👉 From the statistical test, except the Known Allergies all other Diabetes, Blood pressure, Transplants, Chronic diseases and History of Cancer in Family have significant difference in Premium Price, which means they are the factors impacting the premium price.

👉 Both BMI and Age groups have significant impact on the premium price.

👉 Except the known allergies, then the rest are have significant association with the Premium price.

Links¶

	Age	Diabetes	BloodPressureProblems	AnyTransplants	AnyChronicDiseases	Height	Weight	KnownAllergies	HistoryOfCancerInFamily	NumberOfMajorSurgeries	PremiumPrice
count	986.000000	986.000000	986.000000	986.000000	986.000000	986.000000	986.000000	986.000000	986.000000	986.000000	986.000000
mean	41.745436	0.419878	0.468560	0.055781	0.180527	168.182556	76.950304	0.215010	0.117647	0.667343	24336.713996
std	13.963371	0.493789	0.499264	0.229615	0.384821	10.098155	14.265096	0.411038	0.322353	0.749205	6248.184382
min	18.000000	0.000000	0.000000	0.000000	0.000000	145.000000	51.000000	0.000000	0.000000	0.000000	15000.000000
25%	30.000000	0.000000	0.000000	0.000000	0.000000	161.000000	67.000000	0.000000	0.000000	0.000000	21000.000000
50%	42.000000	0.000000	0.000000	0.000000	0.000000	168.000000	75.000000	0.000000	0.000000	1.000000	23000.000000
75%	53.000000	1.000000	1.000000	0.000000	0.000000	176.000000	87.000000	0.000000	0.000000	1.000000	28000.000000
max	66.000000	1.000000	1.000000	1.000000	1.000000	188.000000	132.000000	1.000000	1.000000	3.000000	40000.000000

	Age	Diabetes	BloodPressureProblems	AnyChronicDiseases	Height	Weight	NumberOfMajorSurgeries	PremiumPrice
0	45	0	0	0	155	57	0	25000
1	60	1	0	0	180	73	0	29000
2	36	1	1	0	158	59	1	23000
3	52	1	1	1	183	93	2	28000
4	38	0	0	1	166	88	1	23000