Healthcare fraud is a major issue that costs billions of dollars each year. This problem includes fraudulent billing practices by some medical providers, leading to increased healthcare expenses and tarnishing the reputation of honest providers. Recognizing the scale and impact of this issue is crucial for maintaining trust in the healthcare system.
Healthcare fraud is a major issue that costs billions of dollars each year. This problem includes fraudulent billing practices by some medical providers, leading to increased healthcare expenses and tarnishing the reputation of honest providers. Recognizing the scale and impact of this issue is crucial for maintaining trust in the healthcare system.
The Centers for Medicare and Medicaid Services (CMS) offers key information about health services. This includes data on hospital visits (Inpatient Data), services people get without staying in the hospital (Outpatient Data), and personal details about the people who receive Medicare and Medicaid, like age and if they qualify for these programs (Beneficiary Details Data). This information helps us understand who uses these services, what kind of care they get, and how to make these programs better for everyone.
Hospital admission and outpatient visit records are merged on matching elements. Beneficiary information is then aligned with this data via individual IDs. In the final step, healthcare provider details are incorporated using provider IDs, creating a comprehensive dataset for insights.
In data preprocessing, patient ages are calculated from birthdates, and claim times plus hospital stays are analyzed for trends. Data is then normalized, relevant features selected for better model accuracy, and categories are turned into numbers.
Bar chart showing top 10 medical procedures and diagnosis code with their counts and potential fraud indicators.
Bar chart and pie chart display potential healthcare fraud by provider and the link between chronic conditions and fraud cases.
Scatter plot correlates longer hospital stays with higher claim periods, marked for potential fraud. Bar chart shows most claims are low-cost, with some flagged as fraudulent across cost ranges..
Diagnosis codes and reimbursement amounts are key predictors of fraud in healthcare claims.
We implemented several models, including Logistic Regression, Decision Trees,Random Forest,CNN,LSTM+AE,RNN.
LR (Logistic Regression)
SVC (Support Vector Classifier)
DT (Decision Tree)
NB (Naive Bayes)
The Decision Tree (DT) model outperforms others, showing higher Area Under Curve (AUC) scores in both training and testing.
LSTM (Long Short Term Memory)
LSTM+AE (Long Short Term Memory+AutoEncoder)
CNN (Convolutional Neural Network)
RNN (Recurrent Neural Network)
LSTM and LSTM with Autoencoder show similar stability in accuracy, while CNN and RNN models display consistent validation performance.
ML Models | Accuracy |
---|---|
Logistic Regression | 0.728 |
Decision Tree | 0.726 |
Support Vector Classifier | 0.729 |
Naive Bayes | 0.406 |
Random Forest | 0.742 |
DL Models | Accuracy |
---|---|
LSTM | 0.601 |
LSTM+AE | 0.619 |
CNN | 0.570 |
RNN | 0.409 |
Random Forest outperforms all models, while LSTM+AE-based models lead in deep learning category
1.Identified different types of fraud from data analysis.
2.Random Forest models show superior performance, high accuracy.
3.DL models, with LSTM+AE, show promise in complex fraud pattern recognition.
4.Reimbursements are key indicators of potential fraud. Diagnosis codes strongly correlate with fraudulent activities.
1.Collaborate with healthcare professionals to uncover new fraud typologies.
2.Experiment with various data encoding methods to boost model precision.
3.Enrich datasets with comprehensive profiles of healthcare providers.
4.Build instant fraud alert systems for healthcare transaction monitoring.
5.Train models to differentiate between fraud and legitimate billing anomalies.