Author: Matteo Zecchin

Adaptive Learn-Then-Test

Motivation

Hyperparameter selection is a fundamental step in deploying machine learning models, aimed at assessing whether a model meets specified requirements in terms of performance, robustness, or safety. Recent approaches based on the Learn-Then-Test (LTT) [1] framework formulate this task as a multiple hypothesis testing procedure. For each candidate hyperparameter, LTT tests whether the corresponding model meets a target reliability level by evaluating it on multiple instances of the task (e.g., deploying the model in real-world scenarios). Despite its theoretical guarantees, LTT supports only non-adaptive testing, where all evaluation decisions and the length of the testing phase must be fixed in advance. This rigidity limits its practical utility in safety-critical environments, where minimizing the cost of testing is essential.

E-process-based testing

To overcome this limitation, our recent work—accepted at ICML 2025—introduces adaptive Learn-Then-Test (aLTT), a statistically rigorous, sequential testing framework that enables efficient, data-driven hyperparameter selection with provable reliability guarantees. The core innovation behind aLTT is its use of e-process-based multiple hypothesis testing [2], which replace the traditional p-value-based testing employed in LTT. E-processes support sequential, data-adaptive hypothesis testing while maintaining formal statistical guarantees.

Practically speaking, as illustrated in Figure 1, this means that at each testing round, the experimenter can decide—based on the accumulated evidence—whether to continue testing specific hyperparameters or to stop if a sufficiently large set of reliable candidates has been identified. All of this is achieved without sacrificing the statistical guarantees of the procedure in terms of family-wise error rate (FWER) or false discovery rate (FDR) control. This stands in sharp contrast to p-value-based approaches, where such flexibility would invalidate the statistical guarantees of the procedure. An insidious problem known as p-hacking.

Figure 1: aLTT enables data-adaptive testing and flexible termination rules. At each testing round, based on the accumulated evidence, it is possible to decide which hyperparameters to test next and whether to continue testing.

Automated Prompt Engineering

The aLTT framework is broadly applicable to any setting where reliable configuration must be achieved under limited testing budgets. In our paper, we demonstrate its effectiveness in three concrete domains: configuring wireless network policies, selecting offline reinforcement learning strategies, and optimizing prompts for large language models. In the prompt engineering setting [3], the goal is to identify instructions (prompts) that consistently lead an LLM to generate accurate, relevant, or high-quality responses across tasks. Since each prompt must be tested by running the LLM—often a costly operation—efficiency is critical. aLTT enables the sequential testing of prompts, adaptively prioritizing those that show early promise and terminating the process once enough reliable prompts are found. As shown in Figure 2, this not only reduces the computational burden (yielding a higher true discovery rate under the same testing budget), but also leads to the discovery of shorter, more effective prompts—a valuable property in latency-sensitive or resource-constrained environments. The result: fewer evaluations, higher-quality prompts, and rigorous statistical reliability.

(Left) True positive rate as a function of the testing horizon attained by aLTT with $\epsilon$-greedy exploration and LTT. (Right) Length of the shortest prompt in the predicted set of reliable hyperparameters retuned by aLTT and LTT. aLTT needs fewer testing round to return high quality and short prompts

References

[1] Angelopoulos AN, Bates S, Candès EJ, Jordan MI, Lei L. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052. 2021 Oct 3.

[2] Xu Z, Wang R, Ramdas A. A unified framework for bandit multiple testing. Advances in Neural Information Processing Systems. 2021 Dec 6;34:16833-45.

[3] Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, Ba J. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations 2022 Nov 3.

Localized Adaptive Risk Control

Motivation

In many online decision-making settings, ensuring that predictions are well-calibrated is crucial for the safe operation of systems. One way to achieve calibration is through adaptive risk control, which adjusts the uncertainty estimates of a machine learning model based on past feedback [1]. This method guarantees that the calibration error over an arbitrary sequence is controlled and that, in the long run, the model becomes statistically well-calibrated if the data points are independently and identically distributed [2]. However, these schemes only ensure calibration when averaged across the entire input space, raising concerns about fairness and robustness. For instance, consider the figure below, which depicts a tumor segmentation model calibrated to identify potentially cancerous areas. If the model is calibrated using images from different datasets, marginal calibration may be achieved by prioritizing certain subpopulations at the expense of others.

A tumor segmentation model is calibrated using data from two sources to ensure that the marginal false negative rate (FNR) is controlled. However, as shown on the right, the error rate for one source is significantly lower than for the other, leading to unfair performance across subpopulations.

Localized Adaptive Risk Control

To address this issue, our recent work at NeurIPS 2024 proposes a method to localize uncertainty estimates by leveraging the connection between online learning in reproducing kernel Hilbert spaces [3] and online calibration methods. The key idea behind our approach is to use feedback to adjust a model’s confidence levels only in regions of the input space that are near observed data points. This allows for localized calibration, tailoring uncertainty estimates to specific areas of the input space. We demonstrate that, for adversarial sequences, the number of mistakes can be controlled. More importantly, the scheme provides asymptotic guarantees that are localized, meaning they remain valid under a wide range of covariate shifts, for instance those induced by considering certain subpopulation of the data.

Experiments

Comparison between the coverage map obtained using adaptive risk control (on the left) and localized adaptive risk control (on the right). Adaptive risk control is unable to deliver uniform coverage across the deployment areas, leading to large regions where the SNR level is unsatisfactory. In contrast, localized adaptive risk control is capable of guaranteeing a more uniform SNR level, improving the overall system coverage.

To demonstrate the fairness improvements of our algorithm, we conducted a series of experiments using standard machine learning benchmarks as well as wireless communication problems. Specifically, in the wireless domain, we considered the problem of beam selection based on contextual information. Here, a base station must select a subset of communication beam vectors to guarantee a level of signal-to-noise ratio (SNR) across a deployment area. Standard calibration methods like adaptive risk control (on the left) result in substantial SNR variation across the area, creating regions where communication is impossible. In contrast, our localized adaptive risk control scheme (on the right) enables the base station to calibrate the beam selection algorithm to match the local uncertainty, providing more uniform coverage throughout the deployment area.

 

References

[1] Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34 (2021).

[2] Anastasios Nikolas Angelopoulos, Rina Barber, Stephen Bates. Online conformal prediction with decaying step sizes. Proceedings of the 41st International Conference on Machine Learning. (2024).

[3] Jyrki Kivinen, Alex Smola and Robert C. Williamson. Online Learning with Kernels. Advances in Neural Information Processing Systems, 14 (2001)

Generalization and Informativeness of Conformal Prediction

Motivation

When using a machine learning model to make important decisions, like in healthcare, finance, or engineering, we not only need accurate predictions but also want to know how sure the model is about its answers [1-3]. CP offers a practical solution for generating certified “error bars”—certified ranges of uncertainty—by post-processing the outputs of a fixed, pre-trained base predictor. This is crucial for safety and reliability. At the upcoming ISIT 2024 conference, we will present our research work, which aims to bridge the generalization properties of the base predictor with the expected size of the set predictions, also known as informativeness, produced by CP. Understanding the informativeness of CP is particularly relevant as it can usually only be assessed at test time.

Conformal prediction

Figure 1: Conformal prediction (CP) set predictors (gray areas) obtained by calibrating a base predictor with a higher generalization error on the left and a lower generalization error on the right. Thanks to CP, both set predictors satisfy a user-defined coverage guarantee, but the inefficiency, i.e., the average prediction set size, is larger when the generalization error of the base predictor is larger.

The most practical form of CP, known as inductive CP, divides the available data into a training set and a calibration set [4]. We use the training data to train a base model, and the calibration data to determine the prediction sets around the decisions made by the base model. As shown in Figure 1, a more accurate base predictor, which generalizes better outside the training set, tends to produce more informative sets when CP is applied.

Results

Figure 2: Bound on the average set size for different values of training and calibration data set sizes as a function of the target reliability level. Increasing the number of calibration data points causes the bound to converge exponentially fast to a function (black line) that is increasing in and decreasing in the amount of training data.

Our work’s main contribution is a high probability bound on the expected size of the predicted sets. The bound relates the informativeness of CP to the generalization properties of the base model and the amount of available training and calibration data. As illustrated in Figure 2, our bound predicts that by increasing the amount of calibration data CP’s efficiency converges rapidly to a quantity influenced by the coverage level, the size of the training set, and the predictor’s generalization performance. However, for finite amount of calibration data, the bound is also influenced by the discrepancy between the target and empirical reliability measured over the training data set. Overall, the bound justifies a common practice: allocating more data to train the base model compared to the data used to calibrate it.

Figure 3: Normalized empirical CP set size for a multi-class classification problem on the MNIST data set as a function of the reliability level and for different sizes of the calibration and training data sets.

Since what really proves the worth of a theory is how well it holds up in real-world testing, we also compare our theoretical findings with numerical evaluations. In our study, we looked at two classification and regression tasks. We ran CP with various splits of calibration and training data, then measured the average efficiency. As shown in the Figure 3, the empirical results from our experiments matched up nicely with what our theory predicted in Figure 2.

References

[1] A. L. Beam and I. S. Kohane, “Big data and machine learning in health care,” JAMA, vol. 319, no. 13, pp. 1317–1318, 2018.

[2] J.. W. Goodell, S. Kumar, W. M. Lim, and D. Pattnaik, “Artificial intelligence and machine learning in finance: Identifying foundations, themes, and research clusters from bibliometric analysis,” Journal of Behavioral and Experimental Finance, vol. 32, p. 100577, 2021.

[3] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based model predictive control: Toward safe learning in control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 269–296, 2020.

[4] V. Vovk, A. Gammerman, and G. Shafer, Algorithmic learning in a random world, vol. 29. Springer, 2005.

Safe Model Predictive Control via Reliable Time-Series Forecasting

Motivation

The control of dynamical systems is the backbone of modern technologies, ranging from industrial processes to autonomous vehicles. In many of these scenarios, systems must be controlled while satisfying a set of safety and reliability constraints with respect to the unknown evolution of a target process. For example, as illustrated in Figure 1, autonomous vehicles or unmanned aerial vehicles (UAVs) must plan their trajectory while maintaining a safe distance from other vehicles or obstacles. To this end, predictions about the future evolution of the system must be used. In this context, a primary challenge is to ensure safety and reliability in the face of predictions that are often uncertain.

Figure 1: UAV tracking problem, an example of model predictive control in which the UAV must plan its path based on the unknown evolution of the object to be tracked.

Probabilistic Time Series-Conformal Risk Prediction

To support the deployment of reliable control mechanisms for dynamical system, in our work we have recently proposed probabilistic time series-conformal risk prediction (PTS-CRC). PTS-CRC is a novel post-hoc calibration procedure that operates on the predictions produced by any pre-designed probabilistic forecaster to yield reliable time series prediction sets. As illustrated in Figure 2, PTS-CRC generates predictive sets based on an ensemble of multiple prototype trajectories sampled from the probabilistic model, supporting the efficient representation of forking uncertainties. This contrasts with previous solutions that apply Conformal Prediction[1] to deterministic predictors (TS-CP)[2], which are bounded to produce compact prediction sets. Furthermore, sets produced by PTS-CRC can be calibrated to satisfy a wide array of reliability definitions, beyond the standard one of coverage.

Figure 2: Construction of a prototype-based set predictor based on 3 prototypical sequences.

PTS-CRC Based Model Predictive Control

Based on the reliability properties of PTS-CRC predictions, we devise a novel Model Predictive Control (MPC) framework that addresses open-loop and closed-loop control problems under general average constraints on the quality or safety of the control policy. The key idea is to derive the control by replacing constraints that depend on the unknown dynamics of the target process with those depending on the predictive sets output by PTS-CRC. The reliability requirements of PTS-CRC predictions translate into reliability requirements for the original control problem.

Experiments

We apply PTS-CRC-based MPC to wireless networking problems, specifically focusing on a scenario where a base station must modulate its future power allocation based on the unknown evolution of channel conditions. For instance, we address the challenge of controlling transmit power to maximize the communication rate at an unlicensed user while adhering to a safety requirement, expressed as the maximum interference experienced by a licensed user. By employing PTS-CRC, we can replace the unknown system evolution with efficient multimodal predictive sets that more effectively capture multimodal channel evolution compared to TS-CP (Figure 3). As exemplified in Figure 4, PTS-CRC-based power control leads to power allocations that achieve a higher communication rate compared to TS-CP.

Figure 3: Comparison between the prediction sets of TS-CP and PTS-CRC for the problem of channel gain evolution forecasting.

Figure 4: Comparison between the power control solution obtained using PTS-CRC and TS-CP based MPC.

References

[1] Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer. “Algorithmic learning in a random world,” Vol. 29. New York: Springer, 2005.

[2] Stankeviciute, Kamile, Ahmed M Alaa, and Mihaela van der Schaar. “Conformal time-series forecasting.” Advances in neural information processing systems 34, 2021.

[3] Zecchin, Matteo, Sangwoo Park, and Osvaldo Simeone. “Forking Uncertainties: Reliable Prediction and Model Predictive Control with Sequence Models via Conformal Risk Control.” arXiv preprint arXiv:2310.10299, 2023.