Surrogate modeling for LLM black box interpretability

Surrogate model for 10-year cardiovascular disease risk prediction

Age

years

Smoking status

HDL cholesterol

mg/dL

Sex

Systolic BP

mmHg

LDL cholesterol

mg/dL

Diabetes

Diastolic BP

mmHg

Triglycerides

mg/dL

Treatment for hypertension

Height

cm

HbA1c

%

Dyslipidemia

Weight

kg

Creatinine

mg/dL

History of atrial fibrillation

Waist circumference

cm

Uric acid

mg/dL

Chronic kidney disease

Hip circumference

cm

C-reactive protein

mg/dL

Family history of cardiovascular disease in first degree relatives

Total cholesterol

mg/dL

Race/ethnicity

-% LLM-derived prediction model
-% 2013 ACC/AHA
Pooled Cohort Equations
Surrogate modeling for LLM black box interpretability
Background
Our surrogate modeling framework can be summarized as the quantification of LLM input-output pairs for a target hypothesis derived from domain knowledge, achieved through extensive prompting across a wide range of simulated scenarios. It consists of three stages. The first stage involves generating a comprehensive and extensive simulated dataset for the target hypothesis, following these steps: outcome selection, input variable selection, probability distribution selection, and random sampling. The choice of input variables (and their probability distributions) related to the selected outcome is guided by the target hypothesis derived from prior domain knowledge. The second stage involves extensively obtaining LLM responses for the broad and diverse spectrum of simulated scenarios generated in the first stage. The third stage involves applying statistical modeling (e.g., linear regression), also guided by the target hypothesis, to the pairs of simulated data and LLM responses to quantify and generalize LLM-encoded knowledge into a single statistical formula, making it explainable. For more detailed information, please refer to our publication.
Equations

The equation for the developed surrogate model for 10-year CVD risk prediction is as follows:

Score = -65.243 + Age (years) x 0.784 + Sex** x 6.240 + Diabetes* x 14.562 + Treatment for hypertension* x 9.031 + Dyslipidemia* x 7.294 + History of atrial fibrillation* x 7.029 + Chronic kidney disease* x 12.362 + Family history of cardiovascular disease in first degree relatives* x 2.922 + Ex-smoker* x 1.057 + Current smoker* x 5.418 + Systolic BP (mmHg) x 0.084 + Diastolic BP (mmHg) x 0.017 + BMI x 0.093 + Waist-hip ratio x 0.426 + HDL cholesterol (mg/dL) x (-0.010) + LDL cholesterol (mg/dL) x 0.040 + Triglycerides (mg/dL) x 0.010 + HbA1c (%) x 0.506 + Creatinine (mg/dL) x 1.195 + Uric acid (mg/dL) x 0.035 + C-reactive protein (mg/dL) x 0.258

* Yes = 1, No = 0

** Male = 1, Female = 0

Publications

[Preprint] Han, C., Kim, D. W., Kim, S., Kim, J., Bae, S., & Yoon, D. A Novel GPT-Derived Scoring System for 10-Year Cardiovascular Disease Risk Estimation. Available at SSRN 4763170.

References

[1] Goff Jr, D. C., Lloyd-Jones, D. M., Bennett, G., Coady, S., D’agostino, R. B., Gibbons, R., ... & Wilson, P. W. (2014). 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation, 129(25_suppl_2), S49-S73.

Note: Please be aware that this tool is intended for research purposes only. This page does not involve any direct calls to the LLM; it only performs calculations using the surrogate model.