Medical Insurance Cost Analysis

Medical Insurance Cost Analysis

Predicting insurance charges using machine learning

Demo Link:
Github Repo Url: Link

The Challenge

Medical insurance costs vary significantly due to multiple factors like age, BMI, smoking habits, and region. Traditional estimation methods lack data-driven insights, making it difficult for individuals and insurers to predict charges accurately.

The Solution

Developed a machine learning model to predict insurance costs based on patient demographics and lifestyle factors. Conducted thorough data preprocessing, exploratory data analysis, and feature engineering. Achieved an R-squared score of 0.84 using ridge regression with polynomial features, significantly improving prediction accuracy.

Tech Mastery Showcase

PythonPython

Used for data processing, modeling, and visualization.

Pandas & NumPyPandas & NumPy

Handled data cleaning, wrangling, and feature engineering.

Matplotlib & SeabornMatplotlib & Seaborn

Created visualizations to analyze patterns in insurance charges.

Scikit-learnScikit-learn

Implemented regression models and performed hyperparameter tuning.

ScipyScipy

Applied statistical techniques for feature selection and optimization.

Innovative Logic & Implementation

Data Cleaning & Preprocessing

Handled missing values, converted categorical variables, and engineered new features to enhance model performance.

1def preprocess_data(df):
2      df['smoker'] = df['smoker'].map({'yes': 1, 'no': 0})
3      df = pd.get_dummies(df, columns=['region'], drop_first=True)
4      df['BMI_Category'] = df['bmi'].apply(lambda x: 'High' if x > 30 else 'Normal')
5      return df

Exploratory Data Analysis (EDA)

Conducted statistical analysis and visualized trends affecting insurance costs, such as age, BMI, and smoking status.

1sns.boxplot(x='smoker', y='charges', data=df)
2  plt.title('Impact of Smoking on Insurance Charges')

Regression Modeling & Optimization

Built multiple regression models, experimented with polynomial features, and applied ridge regularization for optimal accuracy.

1from sklearn.linear_model import Ridge
2  model = Ridge(alpha=1.0)
3  model.fit(X_train, y_train)

Overcoming Challenges

Handling Non-Linear Relationships

Linear regression struggled with non-linear interactions among features like BMI and smoking status.

Solution:

Applied polynomial regression and feature transformations to capture complex patterns.

Preventing Overfitting

High-dimensional feature expansion led to overfitting on training data.

Solution:

Used ridge regression with cross-validation to balance model complexity and generalization.

Feature Selection & Engineering

Selecting the most relevant features for model training was crucial for performance.

Solution:

Performed correlation analysis and used Lasso regression to identify key predictive variables.

Key Learnings & Growth

  • 🚀

    Gained experience in full-cycle data preprocessing, EDA, and modeling.

  • 🚀

    Enhanced feature engineering techniques to improve regression performance.

  • 🚀

    Applied hyperparameter tuning to optimize model accuracy.

  • 🚀

    Strengthened skills in handling non-linear relationships in structured data.