{"id":508,"date":"2021-06-23T05:48:07","date_gmt":"2021-06-23T05:48:07","guid":{"rendered":"https:\/\/blogs.kcl.ac.uk\/kclip\/?p=508"},"modified":"2021-06-23T05:48:07","modified_gmt":"2021-06-23T05:48:07","slug":"how-similar-should-similar-tasks-be-in-meta-learning-ask-an-information-theorist","status":"publish","type":"post","link":"https:\/\/blogs.kcl.ac.uk\/kclip\/2021\/06\/23\/how-similar-should-similar-tasks-be-in-meta-learning-ask-an-information-theorist\/","title":{"rendered":"How similar should \u201csimilar\u201d tasks be in meta-learning? (Ask an information theorist)"},"content":{"rendered":"<p><strong>Problem<\/strong><\/p>\n<p>Conventional learning optimizes model parameters using a training algorithm, while meta-learning optimizes the hyperparameters of a training algorithm. A meta-learner has access to data from a class of tasks, and its goal is to ensure that the resulting training algorithm, also called base-learner, performs well on any new tasks from the same class. For example, the base-learner could be a stochastic gradient descent (SGD) algorithm with hyperparameters like initialization or learning rate.<\/p>\n<p>The tasks observed during meta-training are conventionally assumed to belong to a task environment, which defines a distribution over the class of tasks, where each task has an associated data distribution. The statistical properties of the task environment then determine the similarity between the tasks. Intuitively, if the average \u201cdistance\u201d between data distributions of any two tasks in the task environment is small, the meta-learner should be able to learn a suitable shared hyperparameter by observing fewer number of tasks.<\/p>\n<p>In our recent <a href=\"https:\/\/arxiv.org\/abs\/2101.08390\">work<\/a> accepted to ISIT 2021, we build on the above observation and address the following questions for a fixed base-learner and meta-learner: How to measure task similarity? Given the level of similarity of the tasks in the environment, how many tasks and how much data per task should be observed to guarantee that the target average population loss for new tasks can be well approximated using the available meta-training data?<\/p>\n<p>The difference between the average population loss on a new, previously unseen, meta-test task and the meta-training loss on the data gathered from the meta-training tasks is the <strong>meta-generalization gap<\/strong>, and is a measure of the generalization capability of the meta-learner. Our main contribution is a novel <strong>information theoretic bound<\/strong> on the <strong>average absolute value of the meta-generalization gap<\/strong>, that explicitly captures the impact of task relatedness, the number of tasks, and the number of data samples per task on meta-generalization.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Results<\/strong><\/p>\n<p><strong>\u00a0<\/strong>Although information-theoretic bounds on generalization performance of meta-learning have been previously studied \u2013 in both <a href=\"https:\/\/arxiv.org\/pdf\/2005.04372.pdf\">average<\/a> and high probability PAC-Bayesian settings, they fail to capture the impact of task similarity in meta-generalization gap. \u00a0We identify the following distinguishing components of our analysis that enable the explicit characterization of task similarity:<\/p>\n<ul>\n<li>Performance metric &#8211; Earlier <a href=\"https:\/\/arxiv.org\/pdf\/2005.04372.pdf\">work<\/a> on meta-learning considers the <strong>absolute average meta-generalization gap <\/strong>( \u00a0<img loading=\"lazy\" decoding=\"async\" class=\"alignnone  wp-image-515\" src=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2021\/06\/pic3.png\" alt=\"\" width=\"55\" height=\"22\" \/>) as the performance metric, that computes the absolute value of the average of the meta-generalization gap over selection of meta-training and meta-test tasks. By &#8220;mixing up\u2019\u2019 the tasks via first averaging over the tasks and then taking the absolute value, the metric fails to account for the dissimilarity between the training and test tasks.\n<ul>\n<li>We mitigate this drawback via a new metric, namely the <strong>average absolute value of the meta-generalization gap <\/strong>(<img loading=\"lazy\" decoding=\"async\" class=\"alignnone  wp-image-514\" src=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2021\/06\/pic2.png\" alt=\"\" width=\"55\" height=\"25\" \/> ). This metric first computes the absolute value of meta-generalization gap for a\u00a0 given selection of meta-test task and meta-training tasks, and then \u00a0average it over all such selections. By doing so, this metric distinguishes the contribution of each selection of meta-training\u00a0 \u00a0and\u00a0 meta-test tasks, and thus capture the role of similarity between tasks, in the generalization performance of a meta-learner. Moreover, in contrast to absolute average meta-generalization gap, this new metric is non-vanishing in the asymptotic limit of large number of tasks and per-task training samples. This clearly reflects that the meta-training loss cannot provide an asymptotically accurate\u00a0 estimate of meta-test loss, which is evaluated on a priori unknown task.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li>Measures of Task Relatedness \u2013 A task environment is said to be\u00a0 epsilon- related if the average &#8220;distance&#8221; between the data distributions of any two tasks in the environment is upper bounded by epsilon.\u00a0 We consider KL divergence based as well as the Jensen-Shannon based distance measures. While the former can be unbounded, the latter is always bounded.<\/li>\n<\/ul>\n<p>Using the above defined measures of performance and task relatedness, we obtain novel information theoretic bounds on the average absolute value of the meta-generalization gap. The obtained bound demonstrates that (a) as the task dissimilarity parameter \u00a0increases, more number of meta-training tasks are required to ensure meta-generalization, and that \u00a0(b) there exists a non-vanishing gap, which arises due to task dissimilarity, even in the limit of large number of meta-training tasks and meta-test tasks.<\/p>\n<p>We also study examples where the obtained bound can be evaluated analytically or numerically. For the example of ridge regression with meta-learned bias, we illustrate the impact of task dissimilarity parameter on the two performance metrics, and their corresponding upper bounds, \u00a0in the following figure. As can be seen, while the absolute average meta-generalization gap metric appears to be largely insensitive to task dissimilarity, our metric reveals the role of task similarity, as captured by the bounds derived in the paper.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-512 aligncenter\" src=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2021\/06\/pic1.png\" alt=\"\" width=\"585\" height=\"366\" srcset=\"https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2021\/06\/pic1.png 750w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2021\/06\/pic1-300x188.png 300w, https:\/\/blogs.kcl.ac.uk\/kclip\/files\/2021\/06\/pic1-676x423.png 676w\" sizes=\"auto, (max-width: 585px) 100vw, 585px\" \/><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Problem Conventional learning optimizes model parameters using a training algorithm, while meta-learning optimizes the hyperparameters of a training algorithm. A meta-learner has access to data from a class of tasks, and its goal is to ensure that the resulting training algorithm, also called base-learner, performs well on any new tasks from the same class. For [&hellip;]<\/p>\n","protected":false},"author":865,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-508","post","type-post","status-publish","format-standard","hentry","category-uncategorized","post-preview"],"_links":{"self":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts\/508","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/users\/865"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/comments?post=508"}],"version-history":[{"count":4,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts\/508\/revisions"}],"predecessor-version":[{"id":517,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/posts\/508\/revisions\/517"}],"wp:attachment":[{"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/media?parent=508"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/categories?post=508"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/kclip\/wp-json\/wp\/v2\/tags?post=508"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}