Research
Knowledge Graph as Guardrails: Achieving Conceptually Sound LLM Complaint Classification Without Fine-Tuning
Large Language Models (LLMs) have shown remarkable capabilities in various classification tasks, but their deployment in high-stakes environments like banking introduces significant risks, such as when dealing with subtly expressed complaints. This tutorial paper addresses the critical intersection of two challenges: achieving conceptual soundness in LLM applications without fine-tuning, as required by banking regulations, and accurately identifying complaint indicators beneath subtle language. By implementing knowledge graphs as explicit decision frameworks, we present a methodology for making pre-trained LLMs conceptually sound and trustworthy for complaint classification. Our approach emphasizes structured decomposition of decisions, explicit rule implementation, component-level validation, and transparent decision tracing to ensure alignment with regulatory expectations, all without requiring domain-specific fine-tuning of the underlying models.
Human-Calibrated Automated Testing and Validation of Generative Language Models: An Overview
This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine generated evaluations with human judgments through probability calibration and conformal prediction.
In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.
Less Discriminatory Alternative and Interpretable XGBoost Framework for Binary Classification
Fair lending practices and model interpretability are crucial concerns in the financial industry, especially given the increasing use of complex machine learning models. In response to the Consumer Financial Protection Bureau’s (CFPB) requirement to protect consumers against unlawful discrimination, we introduce LDA-XGB1, a novel less discriminatory alternative (LDA) machine learning model for fair and interpretable binary classification. LDA-XGB1 is developed through biobjective optimization that balances accuracy and fairness, with both objectives formulated using binning and information value. It leverages the predictive power and computational efficiency of XGBoost while ensuring inherent model interpretability, including the enforcement of monotonic constraints. We evaluate LDA-XGB1 on two datasets: SimuCredit, a simulated credit approval dataset, and COMPAS, a real-world recidivism prediction dataset. Our results demonstrate that LDA-XGB1 achieves an effective balance between predictive accuracy, fairness, and interpretability, often outperforming traditional fair lending models. This approach equips financial institutions with a powerful tool to meet regulatory requirements for fair lending while maintaining the advantages of advanced machine learning techniques.
Empirical Loss Weight Optimization for PINNModeling Laser Bio-effects on Human Skin for the 1D Heat Equation
The application of deep neural networks towards solving problems in science and engineering has demonstrated encouraging results with the recent formulation of physics-informed neural networks (PINNs). Through the development of refined machine learning techniques, the high computational cost of obtaining numerical solutions for partial differential equations governing complicated physical systems can be mitigated. However, solutions are not guaranteed to be unique, and are subject to uncertainty caused by the choice of network model parameters. For critical systems with significant consequences for errors, assessing and quantifying this model uncertainty is essential. In this paper, an application of PINN for laser bio-effects with limited training data is provided for uncertainty quantification analysis. Additionally, an efficacy study is performed to investigate the impact of the relative weights of the loss components of the PINN and how the uncertainty in the predictions depends on these weights. Network ensembles are constructed to empirically investigate the diversity of solutions across an extensive sweep of hyper parameters to determine the model that consistently reproduces a high-fidelity numerical simulation