Medical Provider Fraud Prediction

Introduction

Healthcare fraud is a major issue that costs billions of dollars each year. This problem includes fraudulent billing practices by some medical providers, leading to increased healthcare expenses and tarnishing the reputation of honest providers. Recognizing the scale and impact of this issue is crucial for maintaining trust in the healthcare system.

Objective

Healthcare fraud is a major issue that costs billions of dollars each year. This problem includes fraudulent billing practices by some medical providers, leading to increased healthcare expenses and tarnishing the reputation of honest providers. Recognizing the scale and impact of this issue is crucial for maintaining trust in the healthcare system.

Data Overview

The Centers for Medicare and Medicaid Services (CMS) offers key information about health services. This includes data on hospital visits (Inpatient Data), services people get without staying in the hospital (Outpatient Data), and personal details about the people who receive Medicare and Medicaid, like age and if they qualify for these programs (Beneficiary Details Data). This information helps us understand who uses these services, what kind of care they get, and how to make these programs better for everyone.

Methodology

Data Preparation

Hospital admission and outpatient visit records are merged on matching elements. Beneficiary information is then aligned with this data via individual IDs. In the final step, healthcare provider details are incorporated using provider IDs, creating a comprehensive dataset for insights.

Figure 1: Data Preperation

Data Preperation

Data Preprocessing

In data preprocessing, patient ages are calculated from birthdates, and claim times plus hospital stays are analyzed for trends. Data is then normalized, relevant features selected for better model accuracy, and categories are turned into numbers.

Exploratory Data Analysis

Figure 2:Top-10 Procedures and Diagnosis codes Involved in Healthcare Fraud

Top 10 Medical Procedures
Top 10 Medical Diagnosis

Bar chart showing top 10 medical procedures and diagnosis code with their counts and potential fraud indicators.

Figure 3:Physicians and chronic conditions Imapct in Healthcare Fraud

Top 20 Providers
Chronic Conditions impact

Bar chart and pie chart display potential healthcare fraud by provider and the link between chronic conditions and fraud cases.

Figure 4:Analysis of Hospital Stay Duration and Claim Costs in Relation to Healthcare Fraud Indicators

duration imapct on fraud
claims cost impact on fraud

Scatter plot correlates longer hospital stays with higher claim periods, marked for potential fraud. Bar chart shows most claims are low-cost, with some flagged as fraudulent across cost ranges..

Figure 5: Top 20 Predictive Features for Healthcare Fraud Detection in Claims Data

feature importance on fraud

Diagnosis codes and reimbursement amounts are key predictors of fraud in healthcare claims.

Model Building

We implemented several models, including Logistic Regression, Decision Trees,Random Forest,CNN,LSTM+AE,RNN.

Results

Machnie learning Models

Figure 6: Comparative Performance Analysis of Machine Learning Models for Fraud Detection

LR (Logistic Regression)

AUC for LR

SVC (Support Vector Classifier)

AUC for SVC

DT (Decision Tree)

AUC for DT

NB (Naive Bayes)

AUC for NB

The Decision Tree (DT) model outperforms others, showing higher Area Under Curve (AUC) scores in both training and testing.

Deep learning Models

Figure 7: Performance Comparison of Deep Learning Models on Training and Validation Sets

LSTM (Long Short Term Memory)

AUC for LSTM

LSTM+AE (Long Short Term Memory+AutoEncoder)

LSTM+AE

CNN (Convolutional Neural Network)

CNN

RNN (Recurrent Neural Network)

RNN

LSTM and LSTM with Autoencoder show similar stability in accuracy, while CNN and RNN models display consistent validation performance.

Results Comparision

Results Comparison

ML Models Accuracy

ML Models Accuracy
Logistic Regression 0.728
Decision Tree 0.726
Support Vector Classifier 0.729
Naive Bayes 0.406
Random Forest 0.742

DL Models Accuracy

DL Models Accuracy
LSTM 0.601
LSTM+AE 0.619
CNN 0.570
RNN 0.409

Random Forest outperforms all models, while LSTM+AE-based models lead in deep learning category

Conclusion and Future Work

Conclusion

1.Identified different types of fraud from data analysis.

2.Random Forest models show superior performance, high accuracy.

3.DL models, with LSTM+AE, show promise in complex fraud pattern recognition.

4.Reimbursements are key indicators of potential fraud. Diagnosis codes strongly correlate with fraudulent activities.

Future Work

1.Collaborate with healthcare professionals to uncover new fraud typologies.

2.Experiment with various data encoding methods to boost model precision.

3.Enrich datasets with comprehensive profiles of healthcare providers.

4.Build instant fraud alert systems for healthcare transaction monitoring.

5.Train models to differentiate between fraud and legitimate billing anomalies.