Category: Edge Computing

Wireless Reliable Federated Inference

Written by Meiyi Zhu during her visit to KCLIP.

Motivation

Consider a wireless federated inference scenario in which the devices and a server share a pre-trained machine learning model, e.g., trained via federated learning. The server wishes to make an inference on its own new input based on such a pre-trained machine learning model. Note that the server has no access to the data; the data is only presented at the devices. This scenario is common in practice. For example, a personal healthcare system would first train the respective model via federated learning, without acquiring personal data from the end users; while upon achieving a trained healthcare model, wishes to provide useful solution to new users. We will assume that new users ask queries to the central server, while the general conclusion made in this article retains even for the case in which the new user has its own access to the pre-trained model.

However, depending on the quality of the pre-trained model, e.g., lack of data, the solution provided by the pre-trained model may yield wrong decisions. More importantly, such model is likely to yield unreliable decisions; see, e.g., our previous post ‘Is Accuracy Sufficient for AI in 6G? (No, Calibration is Equally Important)’. As reliability plays an important role in various fields including healthcare monitoring and autonomous vehicle navigation, it is important to find ways to make the federated inference reliable. But how can we make the pre-trained model reliable as the central server has no access to the data at all?

Recent work has introduced federated conformal prediction (CP), which improves the reliability of the server’s decision by utilizing available held-out local data at each device, of course, without central server’s access to such data. The goal of federated CP is to provide a guaranteed interval or set of potential outputs that contains the correct answer at a predefined reliability level [1, 2]. As a state-of-the-art solution, reference [1] proposed a quantile-of-quantile (QQ) scheme, referred to as FedCP-QQ, whereby each device computes and communicates a pre-determined quantile of the local losses. However, existing work assumed noise-free communication between the server and the devices, whereby devices can communicate a single real number to the server.

Wireless Federated Conformal Prediction

In our recent work, to appear in Transactions on Signal Processing, we study for the first time federated CP in a wireless setting, as illustrated in Fig. 1. Specifically, we introduce a novel protocol, termed wireless federated conformal prediction (WFCP), which builds on type-based multiple access (TBMA) and on a novel quantile correction scheme.

Fig. 1. Illustration of the wireless reliable federated inference problem under study.

TBMA is a multiple access scheme that aims at recovering aggregated statistics, rather than individual messages [3]. By noting that federated CP also requires aggregated statistics across the devices, i.e., quantile, we have proposed to apply TBMA for WFCP. More precisely, as illustrated in Fig. 2, TBMA enables the estimate of the global histogram of data available across all devices without having to separately estimate the histograms of all devices. Specifically, each histogram bin is assigned an orthogonal codeword and the server can estimate the global histogram thanks to the superposition property of wireless communications. In this way, WFCP enables a direct estimate of the global quantile at the server without imposing bandwidth requirements that scale linearly with the number of active devices like FedCP-QQ. Rather, the communication requirements of WFCP are only dictated by the precision with which the signals are represented for transmission to the server, i.e., the length of each codeword.

Fig. 2. Illustration of the TBMA enabled communication model.

The other key technical challenge tackled in our work is the derivation of a novel quantile correction approach that ensures the reliability of the set predictor despite the presence of channel noise.

Experiments

We evaluate our proposed WFCP on CIFAR-10 data set over Rayleigh fading channels. We show here one of the results that plots the performance gains of WFCP in the presence of limited communication resources. In Fig. 3, we evaluate the performance of WFCP and our implementation of existing FedCP-QQ (DQQ) over wireless channels using finite blocklength information theory as a function of SNR. As SNR increases, both WFCP and DQQ maintain the target reliability level, while offering a decreasing prediction set size. Across all the SNRs, WFCP generates a more informative predicted set than DQQ, and it approaches the performance of the centralized CP. Please refer to our paper for more details.

 

Fig. 3. Empirical coverage and normalized empirical inefficiency of centralized CP, WFCP, and digital implementation of existing FedCP-QQ [1].

References

[1] P. Humbert, B. Le Bars, A. Bellet, and S. Arlot, “One-shot federated conformal prediction,” ICML 2023

[2] C. Lu and J. Kalpathy-Cramer, “Distribution-free federated learning with conformal predictions,” arXiv:2110.07661, 2021

[3 G. Mergen and L. Tong, “Type based estimation over multiaccess channels,” IEEE TSP 2006

Privacy in Wireless Federated Learning is Free

–when the SNR is small enough

Problem Description 

Federated Learning (FL) refers to distributed protocols that avoid direct raw data exchange among the participating devices while training for a common learning task. This way, FL can potentially reduce the information on the local data sets that is leaked via communications. Nevertheless, the model updates shared by the devices may still reveal information about local data. For example, a malicious server could potentially infer the presence of an individual data sample from a learnt model by membership inference attack or model inversion attack. 

Differential privacy (DP) quantifies information leaked about individual data points by measuring the sensitivity of the disclosed statistics to changes in the input data set at a single data point. DP can be guaranteed by introducing a level of uncertainty into the released model that is sufficient to mask the contribution of any individual data point. The most typical approach is to add random perturbations, e.g., Gaussian. This suggests that,  when FL is implemented in wireless systems, the channel noise can directly act as a privacy-inducing mechanism. 

Suggested Solution 

In recent work, we have designed differentially private wireless distributed gradient descent via the direct, uncoded, transmission of gradients from devices to edge server. The channel noise is utilized as a privacy preserving mechanism and dynamic power control is separately optimized for orthogonal multiple access  (OMA) and non-orthogonal multiple access  (NOMA) protocols with the goal of minimizing the learning optimality gap under privacy and power constraints across a given number of communication blocks.  Our recent work to appear in IEEE Journal on Selected Areas in Communications tackles this problem. One of our main results shows that, as long as the privacy constraint level, measured via DP, is below a threshold that decreases with the signal-to-noise ratio (SNR), uncoded transmission achieves privacy “for free”, i.e., without affecting the learning performance. As our analysis demonstrates, channel noise added in the first iterations tends to impact convergence less significantly than the noise added in later iterations, whereas the privacy level depends on a weighted sum of the inverse noise power across the iteration. These properties, captured by compact analytical expressions derived in this paper, are leveraged for adaptive power allocation, yielding significant performance gains over standard static power allocation. 

Some Results 

The performance is first evaluated by using randomly generated synthetic dataset. In the considered range of DP level,  as illustrated in the figure below, NOMA with either adaptive or static power allocation (PA) achieves better performance than OMA. Furthermore, the proposed adaptive PA obtains a significant performance gain over static PA under stringent DP constraints, while the performance advantage of adaptive PA decreases as the DP constraint is relaxed. The figure also shows the threshold values of DP level beyond which the privacy “for free”.  

The performance is also evaluated by MNIST data set as summarized in the last figure. With conventional static PA, the increasing communication budget is seen to largely degrade performance. This is because more communication blocks may cause an increase in privacy loss. In contrast, adaptive PA is able to properly allocate power across the communication blocks thereby achieves a lower training loss.

Information-Centric Grant-Free Access for IoT Fog Networks: Edge vs Cloud Detection and Learning

Problem

With the advent of 5G, cellular systems are expected to play an increasing role in enabling Internet of Things (IoT). This is partly due to the introduction of NarrowBand IoT (NB-IoT), a cellular-based radio technology allowing low-cost and long-battery life connections, in addition to other IoT protocols that operate in the unlicensed band such as LoRa. However, these protocols allow for a successful transmission only when a radio resource is used by a single IoT device. Therefore, generally, the amount of resources needed scales with the number of active devices. This poses a serious challenge in enabling massive connectivity in future cellular systems. In our recent IEEE Transactions on Wireless Communications paper, we tackle this issue.

Figure 1: A Fog-Radio architecture where processing information from IoT devices, denoted by the theta symbol,  can take place either at the Cloud or the Edge Node.

Suggested Solution

In our new paper, we propose an information-centric radio access technique where IoT devices making (roughly) the same observation of a given monitored quantity, e.g., temperature, transmit using the same radio resource, i.e., in a non-orthogonal fashion. Thus, the number of radio resources needed scales with the number of possible relevant values observed, e.g., high or low temperature and not with the number of devices.

Cellular networks are evolving toward Fog-Radio architectures, as shown in Figure 1. In these systems, instead of the entire processing happening at the edge node, radio access related functionalities can be distributed between the cloud and the edge. We propose that detection in the IoT system under study be implemented at either cloud or edge depending on backhaul conditions and on the statistics of the observations.

Some Results

One of the important findings of this work is that cloud detection is able to leverage inter-cell interference in order to improve detection performance, as shown in the figure below. This is mainly due to the fact that devices transmitting the same values in different cells are non-orthogonally superposed and thus, the cloud can detect these values with higher confidence.

More details and results can be found in the complete version of the paper here.

Federated Neuromorphic Computing

Problem
Training state-of-the-art Artificial Neural Network (ANN) models requires distributed computing on large mixed CPU-GPU clusters, typically over many days or weeks, at the expense of massive memory, time, and energy resources, and potentially of privacy violations. Alternative solutions for low-power machine learning on resource-constrained devices have been recently the focus of intense research. In our recently accepted paper at ICASSP 2020, we study the convergence of two such recent lines of inquiries.

On the one hand, Spiking Neural Networks (SNNs) are biologically inspired neural networks in which neurons are dynamic elements processing and communicating via sparse spiking signals over time, rather than via real numbers, enabling the native processing of time-encoded data, e.g., from DVS cameras. They can be implemented on dedicated hardware, offering energy consumptions as low as a few picojoules per spike. A more thorough introduction to probabilistic SNNs can be found in this previous blog post.

On the other hand, Federated Learning (FL) allows devices to carry out collaborative learning without exchanging local data. This makes it possible to train more effective machine learning models by benefiting from data at multiple devices with limited privacy concerns. FL requires devices to periodically exchange information about their local model parameters through a parameter server. It has become de-facto standard for training ANNs over large numbers of distributed devices.

System model

Figure 1 Federated Learning (FL) model under study: Mobile devices collaboratively train on-device SNNs based on different, heterogeneous, and generally unbalanced local data sets, by communicating through a base station (BS).

In our work, as seen in Figure 1, we consider a distributed edge computing architecture in which N mobile devices communicate through a Base Station (BS) in order to perform the collaborative training of local SNN models via FL. Each device holds a different local data set. The goal of FL is to train a common SNN-based model without direct exchange of the data from the local data sets.

FL proceeds in an iterative fashion across T global time-steps. To elaborate, at each global time-step, the devices refine their local model, based on their local datasets. Every τ iterations, they will also transmit their updated local model parameters to the BS, which will in turn compute a centralized averaged parameter and send it back to the devices. This global averaged parameter will be used at the beginning of the next iteration.

An SNN is a network of spiking neurons connected via an arbitrary directed graph, possibly with cycles (see Figure 2). SNNs process information through time, based on a local clock. At each local algorithmic time-step, each neuron receives the signals emitted by the subset of neurons connected to it through directed links, known as synapses. Neurons in the network will then output a binary signal, either ‘0’ or ‘1’. The instantaneous spiking probability of a neuron is determined by its past spiking behaviour and the previous spikes of its pre-synaptic neurons. SNNs are trained over sequences of S local algorithmic time-steps, made of D examples of length S’. In an image classification task, an example could be an image encoded as a binary signal.

Figure 2 Example of an internal architecture for an on-device SNN.

In FL-SNN, we cooperatively train distributed on-device SNNs thanks to Federated Learning. To that end, we derived a novel algorithm, for which the time scales involved are summarized in Figure 3. Each global algorithmic iteration t corresponds to Δs local SNN time-steps, and the total number S of SNN local algorithmic time steps and the number T of global algorithmic time steps during the training procedure are hence related as S = DS’ = T∆s.

Figure 3 Illustration of the time scales involved in the cooperative training of SNNs via FL for τ = 3 and ∆s = 4.

Experiments

We consider a classification task based on the MNIST-DVS dataset. The training dataset is composed of 900 examples per class and the test dataset is composed of 100 samples per class. We consider 2 devices which have access to disjoint subsets of the training dataset. In order to validate the advantages of FL, we assume that the first device has only samples from class ‘1’ and the second only from class ‘7’. We train over D = 400 randomly selected examples from the local data sets, which results in S = DS’ = 32,000 local time-steps.

As a baseline, we consider the test loss at convergence for the separate training of the two SNNs. In Figure 4, we plot the local test loss normalized by the mentioned baseline as a function of the global algorithmic time. A larger communication period τ is seen to impair the learning capabilities of the SNNs, yielding a larger final value of the loss. In fact, for τ = 400, after a number of local iterations without communication, the individual devices are not able to make use of their data to improve performance.

Figure 4 Evolution of the mean test loss during training for different values of the communication period τ. Shaded areas represent standard deviations over 3 trials

One of the major flaws of FL is the communication load incurred by the need to regularly transmit large model parameters. To partially explore this aspect, in the paper, we consider exchanging only a subset of synaptic weights during global iterations. We refer to the text at this link for details.

Integrating Wireless Access and Edge Learning

Problem

Figure 1. Delay-constrained edge learning based on data received from a device.

The increasing number of connected devices has led to an explosion in the amounts of data being collected: smartphones, wearable devices and sensors generate data to an extent previously unseen. However, these devices often present power and computational capability constraints that do not allow them to make use of the data – for instance, to train Machine Learning (ML) models. In such circumstances, thanks to mobile edge computing, devices can rely on remote servers to perform the data processing (see Fig. 1). When the amount of data is large, or the access link slow, the amount of time required to transmit the data may be prohibitive. Given a delay constraint on the overall time available for both communication and learning, what is the joint communication-computation strategy that obtains the best performing ML model?

Pipelining communication and computation

Figure 2. Transmission and training protocol.

In a recent work to be published in IEEE Communication Letters, we propose to pipeline communication and computation with an optimized block size. We consider an Empirical Risk Minimization (ERM) problem, for which learning is carried at the server side using Stochastic Gradient Descent (SGD). As the first data block arrives at the server, training of the ML model can start. This continues by fetching data from all the data blocks received thus far. To provide some intuition on the problem of optimizing the block size, communicating the entire data set first reduces the bias of the training process but it may not leave sufficient time for learning. Conversely, transmitting very few samples in each block will bias the model towards the samples sent in the first blocks, as many computation rounds will happen based on these samples.
We determine an upper bound on the expected optimality gap at the end of the time limit, which gives us an indication on how far we are from an optimal model. We can then minimize this bound with regard to the communication block size to obtain an optimized value.

Some results

Figure 3. Training loss versus training time for different values of the block size. Solid line: experimental and theoretical optima.

Numerical experiments allowed us to compare the optimal block size found using the bound with a numerically determined optimal value found by running Monte Carlo experiments over all possible block sizes. Determining the optimal value through an extensive search over the possible block sizes allowed a gain of 3.8% in terms of the final training loss in one of our experiments (see Fig. 3). This small gain comes at the cost of a burdensome parameter optimization that took days on an HPC cluster. Minimizing the proposed bound takes seconds.
We further experimentally determined that our results, which were derived for convex loss functions satisfying the Polyak-Lojasiewicz condition, can be extended to non-convex models. As an example (not found in the paper), we studied the problem of training a multilayer perceptron with non-linear activations according to our scheme (see Fig. 4). Using the same dataset as described in the paper, we train a 2-layers perceptron with ReLU activation for the first layer and linear activation for the second. The experiments show a similar behaviour to the convex example discussed in the main text. In particular, the derived bound predicts well the existence of an optimum value of the block size (see crosses).

Figure 4. Training loss versus block size for different overhead sizes, for an MLP with non-linear activations.

The full paper can be found here.