{"id":956,"date":"2025-05-02T08:45:56","date_gmt":"2025-05-02T08:45:56","guid":{"rendered":"https:\/\/blogs.kcl.ac.uk\/kclip\/?p=956"},"modified":"2025-05-02T08:46:37","modified_gmt":"2025-05-02T08:46:37","slug":"adaptive-learn-then-test","status":"publish","type":"post","link":"https:\/\/blogs.kcl.ac.uk\/kclip\/2025\/05\/02\/adaptive-learn-then-test\/","title":{"rendered":"Adaptive Learn-Then-Test"},"content":{"rendered":"<h2 style=\"font-weight: 400\"><strong>Motivation<\/strong><\/h2>\n<div>\n<p class=\"p1\">Hyperparameter selection is a fundamental step in deploying machine learning models, aimed at assessing whether a model meets specified requirements in terms of performance, robustness, or safety. Recent approaches based on the <b>Learn-Then-Test <\/b>(LTT) [1] framework formulate this task as <b>a multiple hypothesis testing procedure<\/b>. For each candidate hyperparameter, LTT tests whether the corresponding model meets a target reliability level by evaluating it on multiple instances of the task (e.g., deploying the model in real-world scenarios). Despite its theoretical guarantees, LTT supports only <b>non-adaptive testing<\/b>, where all evaluation decisions and the length of the testing phase must be fixed in advance. This rigidity limits its practical utility in safety-critical environments, where minimizing the cost of testing is essential.<\/p>\n<h2 style=\"font-weight: 400\"><strong>E-process-based testing<\/strong><\/h2>\n<div>\n<p class=\"p1\">To overcome this limitation, <a href=\"https:\/\/arxiv.org\/abs\/2409.15844\">our recent work<\/a>\u2014accepted at ICML 2025\u2014introduces <span class=\"s1\"><b>adaptive Learn-Then-Test (aLTT)<\/b><\/span>, a statistically rigorous, sequential testing framework that enables efficient, data-driven hyperparameter selection with provable reliability guarantees. The core innovation behind aLTT is its use of <span class=\"s1\"><b>e-process-based multiple hypothesis testing [2]<\/b><\/span>, which replace the traditional p-value-based testing employed in LTT. E-processes support sequential, data-adaptive hypothesis testing while maintaining formal statistical guarantees.<\/p>\n<\/div>\n<div>\n<p class=\"p1\">Practically speaking, as illustrated in Figure 1, this means that at each testing round, the experimenter can decide\u2014based on the accumulated evidence\u2014whether to continue testing specific hyperparameters or to stop if a sufficiently large set of reliable candidates has been identified. All of this is achieved <span class=\"s1\"><b>without sacrificing<\/b><\/span> the statistical guarantees of the procedure in terms of <span class=\"s1\"><b>family-wise error rate (FWER)<\/b><\/span> or <span class=\"s1\"><b>false discovery rate (FDR)<\/b><\/span> control. This stands in sharp contrast to p-value-based approaches, where such flexibility would invalidate the statistical guarantees of the procedure. An insidious problem known as p-hacking.<\/p>\n<div id=\"attachment_957\" style=\"width: 1062px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-957\" class=\" wp-image-957\" src=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Picture1-300x83.png\" alt=\"\" width=\"1052\" height=\"291\" srcset=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Picture1-300x83.png 300w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Picture1-1024x282.png 1024w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Picture1-768x212.png 768w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Picture1-676x186.png 676w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Picture1.png 1430w\" sizes=\"auto, (max-width: 1052px) 100vw, 1052px\" \/><p id=\"caption-attachment-957\" class=\"wp-caption-text\">Figure 1: aLTT enables data-adaptive testing and flexible termination rules. At each testing round, based on the accumulated evidence, it is possible to decide which hyperparameters to test next and whether to continue testing.<\/p><\/div>\n<h2>Automated Prompt Engineering<\/h2>\n<div>\n<p class=\"p1\">The aLTT framework is broadly applicable to any setting where reliable configuration must be achieved under limited testing budgets. In our paper, we demonstrate its effectiveness in three concrete domains: configuring wireless network policies, selecting offline reinforcement learning strategies, and optimizing prompts for large language models. In the <span class=\"s1\"><b>prompt engineering<\/b><\/span> setting [3], the goal is to identify instructions (prompts) that consistently lead an LLM to generate accurate, relevant, or high-quality responses across tasks. Since each prompt must be tested by running the LLM\u2014often a costly operation\u2014efficiency is critical. aLTT enables the sequential testing of prompts, adaptively prioritizing those that show early promise and terminating the process once enough reliable prompts are found. As shown in <span class=\"s1\"><b>Figure 2<\/b><\/span>, this not only reduces the computational burden (yielding a higher true discovery rate under the same testing budget), but also leads to the discovery of <span class=\"s1\"><b>shorter, more effective prompts<\/b><\/span>\u2014a valuable property in <span class=\"s1\"><b>latency-sensitive<\/b><\/span> or <span class=\"s1\"><b>resource-constrained<\/b><\/span> environments. <span class=\"s1\"><b>The result:<\/b><\/span> fewer evaluations, higher-quality prompts, and rigorous statistical reliability.<\/p>\n<div id=\"attachment_960\" style=\"width: 685px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-960\" class=\" wp-image-960\" src=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Results-300x188.png\" alt=\"\" width=\"675\" height=\"423\" srcset=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Results-300x188.png 300w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Results-1024x640.png 1024w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Results-768x480.png 768w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Results-676x423.png 676w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2025\/05\/Results.png 1454w\" sizes=\"auto, (max-width: 675px) 100vw, 675px\" \/><p id=\"caption-attachment-960\" class=\"wp-caption-text\">(Left) True positive rate as a function of the testing horizon attained by aLTT with $\\epsilon$-greedy exploration and LTT. (Right) Length of the shortest prompt in the predicted set of reliable hyperparameters retuned by aLTT and LTT. aLTT needs fewer testing round to return high quality and short prompts<\/p><\/div>\n<\/div>\n<h3>References<\/h3>\n<p>[1] Angelopoulos AN, Bates S, Cand\u00e8s EJ, Jordan MI, Lei L. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052. 2021 Oct 3.<\/p>\n<p>[2] Xu Z, Wang R, Ramdas A. A unified framework for bandit multiple testing. Advances in Neural Information Processing Systems. 2021 Dec 6;34:16833-45.<\/p>\n<p>[3] Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, Ba J. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations 2022 Nov 3.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Motivation Hyperparameter selection is a fundamental step in deploying machine learning models, aimed at assessing whether a model meets specified requirements in terms of performance, robustness, or safety. Recent approaches based on the Learn-Then-Test (LTT) [1] framework formulate this task as a multiple hypothesis testing procedure. For each candidate hyperparameter, LTT tests whether the corresponding [&hellip;]<\/p>\n","protected":false},"author":1311,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-956","post","type-post","status-publish","format-standard","hentry","category-uncategorized","post-preview"],"_links":{"self":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts\/956","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/users\/1311"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/comments?post=956"}],"version-history":[{"count":7,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts\/956\/revisions"}],"predecessor-version":[{"id":965,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts\/956\/revisions\/965"}],"wp:attachment":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/media?parent=956"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/categories?post=956"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/tags?post=956"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}