Category: information theory

Generalization and Informativeness of Conformal Prediction

Motivation

When using a machine learning model to make important decisions, like in healthcare, finance, or engineering, we not only need accurate predictions but also want to know how sure the model is about its answers [1-3]. CP offers a practical solution for generating certified “error bars”—certified ranges of uncertainty—by post-processing the outputs of a fixed, pre-trained base predictor. This is crucial for safety and reliability. At the upcoming ISIT 2024 conference, we will present our research work, which aims to bridge the generalization properties of the base predictor with the expected size of the set predictions, also known as informativeness, produced by CP. Understanding the informativeness of CP is particularly relevant as it can usually only be assessed at test time.

Conformal prediction

Figure 1: Conformal prediction (CP) set predictors (gray areas) obtained by calibrating a base predictor with a higher generalization error on the left and a lower generalization error on the right. Thanks to CP, both set predictors satisfy a user-defined coverage guarantee, but the inefficiency, i.e., the average prediction set size, is larger when the generalization error of the base predictor is larger.

The most practical form of CP, known as inductive CP, divides the available data into a training set and a calibration set [4]. We use the training data to train a base model, and the calibration data to determine the prediction sets around the decisions made by the base model. As shown in Figure 1, a more accurate base predictor, which generalizes better outside the training set, tends to produce more informative sets when CP is applied.

Results

Figure 2: Bound on the average set size for different values of training and calibration data set sizes as a function of the target reliability level. Increasing the number of calibration data points causes the bound to converge exponentially fast to a function (black line) that is increasing in and decreasing in the amount of training data.

Our work’s main contribution is a high probability bound on the expected size of the predicted sets. The bound relates the informativeness of CP to the generalization properties of the base model and the amount of available training and calibration data. As illustrated in Figure 2, our bound predicts that by increasing the amount of calibration data CP’s efficiency converges rapidly to a quantity influenced by the coverage level, the size of the training set, and the predictor’s generalization performance. However, for finite amount of calibration data, the bound is also influenced by the discrepancy between the target and empirical reliability measured over the training data set. Overall, the bound justifies a common practice: allocating more data to train the base model compared to the data used to calibrate it.

Figure 3: Normalized empirical CP set size for a multi-class classification problem on the MNIST data set as a function of the reliability level and for different sizes of the calibration and training data sets.

Since what really proves the worth of a theory is how well it holds up in real-world testing, we also compare our theoretical findings with numerical evaluations. In our study, we looked at two classification and regression tasks. We ran CP with various splits of calibration and training data, then measured the average efficiency. As shown in the Figure 3, the empirical results from our experiments matched up nicely with what our theory predicted in Figure 2.

References

[1] A. L. Beam and I. S. Kohane, “Big data and machine learning in health care,” JAMA, vol. 319, no. 13, pp. 1317–1318, 2018.

[2] J.. W. Goodell, S. Kumar, W. M. Lim, and D. Pattnaik, “Artificial intelligence and machine learning in finance: Identifying foundations, themes, and research clusters from bibliometric analysis,” Journal of Behavioral and Experimental Finance, vol. 32, p. 100577, 2021.

[3] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based model predictive control: Toward safe learning in control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 269–296, 2020.

[4] V. Vovk, A. Gammerman, and G. Shafer, Algorithmic learning in a random world, vol. 29. Springer, 2005.

Information-Centric Grant-Free Access for IoT Fog Networks: Edge vs Cloud Detection and Learning

Problem

With the advent of 5G, cellular systems are expected to play an increasing role in enabling Internet of Things (IoT). This is partly due to the introduction of NarrowBand IoT (NB-IoT), a cellular-based radio technology allowing low-cost and long-battery life connections, in addition to other IoT protocols that operate in the unlicensed band such as LoRa. However, these protocols allow for a successful transmission only when a radio resource is used by a single IoT device. Therefore, generally, the amount of resources needed scales with the number of active devices. This poses a serious challenge in enabling massive connectivity in future cellular systems. In our recent IEEE Transactions on Wireless Communications paper, we tackle this issue.

Figure 1: A Fog-Radio architecture where processing information from IoT devices, denoted by the theta symbol,  can take place either at the Cloud or the Edge Node.

Suggested Solution

In our new paper, we propose an information-centric radio access technique where IoT devices making (roughly) the same observation of a given monitored quantity, e.g., temperature, transmit using the same radio resource, i.e., in a non-orthogonal fashion. Thus, the number of radio resources needed scales with the number of possible relevant values observed, e.g., high or low temperature and not with the number of devices.

Cellular networks are evolving toward Fog-Radio architectures, as shown in Figure 1. In these systems, instead of the entire processing happening at the edge node, radio access related functionalities can be distributed between the cloud and the edge. We propose that detection in the IoT system under study be implemented at either cloud or edge depending on backhaul conditions and on the statistics of the observations.

Some Results

One of the important findings of this work is that cloud detection is able to leverage inter-cell interference in order to improve detection performance, as shown in the figure below. This is mainly due to the fact that devices transmitting the same values in different cells are non-orthogonally superposed and thus, the cloud can detect these values with higher confidence.

More details and results can be found in the complete version of the paper here.

Using Machine learning to Measure Intrinsic and Synergistic Information Flows

Context

Quantifying the causal flow of information between different components of a system is an important task for many natural and engineered systems, such as neural, genetic, transportation and social networks. A well-established metric of the information flow between two time sequences  and  that has been widely applied for this purpose is the information-theoretic measure of Transfer Entropy (TE). The TE equals the mutual information between the past of sequence  and the current value at time t when conditioning on the past of . However, the TE has limitations as a measure of intrinsic, or exclusive, information flow from sequence to sequence . In fact, as pointed out in this paper, the TE captures not only the amount of information on that is contained in the past of in addition to that already present in the past of , but also the information about that is obtained only when combining the past of both and . Only the first type of information flow may be defined as intrinsic, while the second can be thought of as a synergistic flow of information involving both sequences.

In the same paper, the authors propose to decompose the TE as the sum of an Intrinsic TE (ITE) and a Synergistic TE (STE), and introduce a measure of the ITE based on cryptography. The idea is to measure the ITE as the size (in bits) of a secret key that can be generated by two parties, one holding the past of sequence and the other , via public communication, when the adversary has the past of sequence .

The computation of ITE is generally intractable. To estimate ITE, in recent work, we proposed an estimator, referred to as ITE Neural Estimator (ITENE), of the ITE that is based on variational bound on the KL divergence, two-sample neural network classifiers, and the pathwise estimator of Monte Carlo gradients.

 

Some Results

We first apply the proposed estimator to the following toy example. The joint processes are generated according to

for some threshold λ, where variables are independent and identically distributed as .  Intuitively, for large values of the threshold λ, there is no information flow between  and , while for small values, there is a purely intrinsic flow of information. For intermediate values of λ, the information flow is partly synergistic, since knowing both and is instrumental in obtaining

Figure 1

 

information about .  As illustrated in Fig. 1, the results obtained from the estimator are consistent with this intuition.

 

Figure 2

For a real-world example, we apply the estimators at hand to historic data of the values of the Hang Seng Index (HSI) and of the Dow Jones Index (DJIA) between 1990 and 2011 (see Fig. 2). As illustrated in Fig. 3, both the TE and ITE from the DJIA to the HSI are much larger than in the reverse direction, implying that the DJIA influenced the HSI more significantly than the other way around for the

Figure 3

given time range. Furthermore, we observe that not all the information flow is estimated to be intrinsic, and hence the joint observation of the history of the DJIA and of the HSI is partly responsible for the predictability of the HSI from the DJIA.

The full paper will be presented at 2020 International Zurich Seminar on Information and Communication and can be found here.