Category: distributed computing

Wireless Reliable Federated Inference

Written by Meiyi Zhu during her visit to KCLIP.

Motivation

Consider a wireless federated inference scenario in which the devices and a server share a pre-trained machine learning model, e.g., trained via federated learning. The server wishes to make an inference on its own new input based on such a pre-trained machine learning model. Note that the server has no access to the data; the data is only presented at the devices. This scenario is common in practice. For example, a personal healthcare system would first train the respective model via federated learning, without acquiring personal data from the end users; while upon achieving a trained healthcare model, wishes to provide useful solution to new users. We will assume that new users ask queries to the central server, while the general conclusion made in this article retains even for the case in which the new user has its own access to the pre-trained model.

However, depending on the quality of the pre-trained model, e.g., lack of data, the solution provided by the pre-trained model may yield wrong decisions. More importantly, such model is likely to yield unreliable decisions; see, e.g., our previous post ‘Is Accuracy Sufficient for AI in 6G? (No, Calibration is Equally Important)’. As reliability plays an important role in various fields including healthcare monitoring and autonomous vehicle navigation, it is important to find ways to make the federated inference reliable. But how can we make the pre-trained model reliable as the central server has no access to the data at all?

Recent work has introduced federated conformal prediction (CP), which improves the reliability of the server’s decision by utilizing available held-out local data at each device, of course, without central server’s access to such data. The goal of federated CP is to provide a guaranteed interval or set of potential outputs that contains the correct answer at a predefined reliability level [1, 2]. As a state-of-the-art solution, reference [1] proposed a quantile-of-quantile (QQ) scheme, referred to as FedCP-QQ, whereby each device computes and communicates a pre-determined quantile of the local losses. However, existing work assumed noise-free communication between the server and the devices, whereby devices can communicate a single real number to the server.

Wireless Federated Conformal Prediction

In our recent work, to appear in Transactions on Signal Processing, we study for the first time federated CP in a wireless setting, as illustrated in Fig. 1. Specifically, we introduce a novel protocol, termed wireless federated conformal prediction (WFCP), which builds on type-based multiple access (TBMA) and on a novel quantile correction scheme.

Fig. 1. Illustration of the wireless reliable federated inference problem under study.

TBMA is a multiple access scheme that aims at recovering aggregated statistics, rather than individual messages [3]. By noting that federated CP also requires aggregated statistics across the devices, i.e., quantile, we have proposed to apply TBMA for WFCP. More precisely, as illustrated in Fig. 2, TBMA enables the estimate of the global histogram of data available across all devices without having to separately estimate the histograms of all devices. Specifically, each histogram bin is assigned an orthogonal codeword and the server can estimate the global histogram thanks to the superposition property of wireless communications. In this way, WFCP enables a direct estimate of the global quantile at the server without imposing bandwidth requirements that scale linearly with the number of active devices like FedCP-QQ. Rather, the communication requirements of WFCP are only dictated by the precision with which the signals are represented for transmission to the server, i.e., the length of each codeword.

Fig. 2. Illustration of the TBMA enabled communication model.

The other key technical challenge tackled in our work is the derivation of a novel quantile correction approach that ensures the reliability of the set predictor despite the presence of channel noise.

Experiments

We evaluate our proposed WFCP on CIFAR-10 data set over Rayleigh fading channels. We show here one of the results that plots the performance gains of WFCP in the presence of limited communication resources. In Fig. 3, we evaluate the performance of WFCP and our implementation of existing FedCP-QQ (DQQ) over wireless channels using finite blocklength information theory as a function of SNR. As SNR increases, both WFCP and DQQ maintain the target reliability level, while offering a decreasing prediction set size. Across all the SNRs, WFCP generates a more informative predicted set than DQQ, and it approaches the performance of the centralized CP. Please refer to our paper for more details.

 

Fig. 3. Empirical coverage and normalized empirical inefficiency of centralized CP, WFCP, and digital implementation of existing FedCP-QQ [1].

References

[1] P. Humbert, B. Le Bars, A. Bellet, and S. Arlot, “One-shot federated conformal prediction,” ICML 2023

[2] C. Lu and J. Kalpathy-Cramer, “Distribution-free federated learning with conformal predictions,” arXiv:2110.07661, 2021

[3 G. Mergen and L. Tong, “Type based estimation over multiaccess channels,” IEEE TSP 2006

Distributed Quantum Entanglement Distillation via Quantum Machine Learning

Quantum networking, and with it the quantum Internet, rely on the management and exploitation of entanglement. In fact, entangled qubits enable fundamental quantum communication primitives such as teleportation and superdense coding. Practical sources of entangled qubits, such as single-photon detection, are imperfect, producing mixed states with reduced fidelity as compared to ideal, fully entangled, Bell pairs. In order to enhance the fidelity of entangled qubits available at distributed parties, entanglement distillation protocols leverage local operations and classical communication (LOCC). While existing solutions, such as DEJMPS protocol [1] and LOCCNet [2], assume ideal classical communications, we study the case in which communications between the parties holding imperfectly entangled qubits are noisy. As illustrated in Fig. 1, to address this more challenging scenario, we propose the use of quantum machine learning (QML) [3] via parameterized quantum circuits (PQCs).

 

Fig. 1: Entanglement distillation at two quantum-enabled devices (Alice and Bob) aided by a noisy classical communication channel to a third party (Charlie).

 

Noise Aware-LOCCNet (NA-LOCCNet)

In our recent work, accepted for presentation at IEEE ICASSP 2023, we introduce NA-LOCCNet, as shown in Fig. 2, which improves average output fidelity while accounting for the channel errors.

 

Fig. 2: Proposed Noise Aware-LOCCNet (NA-LOCCNet) circuit for distilling two S states.

 

Experiments

Fig.3 plots average output fidelity as a function of bit-flip probability of noisy channel for a given input fidelity, whereas Fig. 4 plots the same quantity as a function of input fidelity for a given bit flip probability of noisy channel. In both the figures NA-LOCCNet performs far better than the state of the art protocols.

 

Fig. 3: Average output fidelity as a function of the bit flip probability p of the noisy classical channels from Alice and Bob to Charlie for input fidelity F = 0.6.

 

Fig. 4: Average output fidelity, conditioned on a successful distillation, as a function of the input fidelity F for bit flip probability p = 0.25 on the noisy classical channels from Alice and Bob to Charlie. The black dashed line corresponds to the reference performance of a scheme that simply outputs the input state.

 

In our another recent work, published in Entropy, we have extended the NA-LOCCNet framework to the problem of quantum state discrimination.

 

[1] D. Deutsch, A. Ekert, R. Jozsa, C. Macchiavello, S. Popescu, and A. Sanpera, “Quantum privacy amplification and the security of quantum cryptography over noisy channels”, Phys. Rev. Lett., vol. 77, pp. 2818– 2821, Sep 1996.

[2] X. Zhao, B. Zhao, Z. Wang, Z. Song, and X. Wang, “Practical distributed quantum information processing with LOCCNet,” Quantum Information, vol. 7, no. 1, pp. 1–7, 2021.

[3] O. Simeone, “An introduction to quantum machine learning for engineers”,
Foundations and Trends in Signal Processing, vol. 16, no. 1-2, pp. 1–223, 2022.

Privacy in Wireless Federated Learning is Free

–when the SNR is small enough

Problem Description 

Federated Learning (FL) refers to distributed protocols that avoid direct raw data exchange among the participating devices while training for a common learning task. This way, FL can potentially reduce the information on the local data sets that is leaked via communications. Nevertheless, the model updates shared by the devices may still reveal information about local data. For example, a malicious server could potentially infer the presence of an individual data sample from a learnt model by membership inference attack or model inversion attack. 

Differential privacy (DP) quantifies information leaked about individual data points by measuring the sensitivity of the disclosed statistics to changes in the input data set at a single data point. DP can be guaranteed by introducing a level of uncertainty into the released model that is sufficient to mask the contribution of any individual data point. The most typical approach is to add random perturbations, e.g., Gaussian. This suggests that,  when FL is implemented in wireless systems, the channel noise can directly act as a privacy-inducing mechanism. 

Suggested Solution 

In recent work, we have designed differentially private wireless distributed gradient descent via the direct, uncoded, transmission of gradients from devices to edge server. The channel noise is utilized as a privacy preserving mechanism and dynamic power control is separately optimized for orthogonal multiple access  (OMA) and non-orthogonal multiple access  (NOMA) protocols with the goal of minimizing the learning optimality gap under privacy and power constraints across a given number of communication blocks.  Our recent work to appear in IEEE Journal on Selected Areas in Communications tackles this problem. One of our main results shows that, as long as the privacy constraint level, measured via DP, is below a threshold that decreases with the signal-to-noise ratio (SNR), uncoded transmission achieves privacy “for free”, i.e., without affecting the learning performance. As our analysis demonstrates, channel noise added in the first iterations tends to impact convergence less significantly than the noise added in later iterations, whereas the privacy level depends on a weighted sum of the inverse noise power across the iteration. These properties, captured by compact analytical expressions derived in this paper, are leveraged for adaptive power allocation, yielding significant performance gains over standard static power allocation. 

Some Results 

The performance is first evaluated by using randomly generated synthetic dataset. In the considered range of DP level,  as illustrated in the figure below, NOMA with either adaptive or static power allocation (PA) achieves better performance than OMA. Furthermore, the proposed adaptive PA obtains a significant performance gain over static PA under stringent DP constraints, while the performance advantage of adaptive PA decreases as the DP constraint is relaxed. The figure also shows the threshold values of DP level beyond which the privacy “for free”.  

The performance is also evaluated by MNIST data set as summarized in the last figure. With conventional static PA, the increasing communication budget is seen to largely degrade performance. This is because more communication blocks may cause an increase in privacy loss. In contrast, adaptive PA is able to properly allocate power across the communication blocks thereby achieves a lower training loss.

Federated Neuromorphic Computing

Problem
Training state-of-the-art Artificial Neural Network (ANN) models requires distributed computing on large mixed CPU-GPU clusters, typically over many days or weeks, at the expense of massive memory, time, and energy resources, and potentially of privacy violations. Alternative solutions for low-power machine learning on resource-constrained devices have been recently the focus of intense research. In our recently accepted paper at ICASSP 2020, we study the convergence of two such recent lines of inquiries.

On the one hand, Spiking Neural Networks (SNNs) are biologically inspired neural networks in which neurons are dynamic elements processing and communicating via sparse spiking signals over time, rather than via real numbers, enabling the native processing of time-encoded data, e.g., from DVS cameras. They can be implemented on dedicated hardware, offering energy consumptions as low as a few picojoules per spike. A more thorough introduction to probabilistic SNNs can be found in this previous blog post.

On the other hand, Federated Learning (FL) allows devices to carry out collaborative learning without exchanging local data. This makes it possible to train more effective machine learning models by benefiting from data at multiple devices with limited privacy concerns. FL requires devices to periodically exchange information about their local model parameters through a parameter server. It has become de-facto standard for training ANNs over large numbers of distributed devices.

System model

Figure 1 Federated Learning (FL) model under study: Mobile devices collaboratively train on-device SNNs based on different, heterogeneous, and generally unbalanced local data sets, by communicating through a base station (BS).

In our work, as seen in Figure 1, we consider a distributed edge computing architecture in which N mobile devices communicate through a Base Station (BS) in order to perform the collaborative training of local SNN models via FL. Each device holds a different local data set. The goal of FL is to train a common SNN-based model without direct exchange of the data from the local data sets.

FL proceeds in an iterative fashion across T global time-steps. To elaborate, at each global time-step, the devices refine their local model, based on their local datasets. Every τ iterations, they will also transmit their updated local model parameters to the BS, which will in turn compute a centralized averaged parameter and send it back to the devices. This global averaged parameter will be used at the beginning of the next iteration.

An SNN is a network of spiking neurons connected via an arbitrary directed graph, possibly with cycles (see Figure 2). SNNs process information through time, based on a local clock. At each local algorithmic time-step, each neuron receives the signals emitted by the subset of neurons connected to it through directed links, known as synapses. Neurons in the network will then output a binary signal, either ‘0’ or ‘1’. The instantaneous spiking probability of a neuron is determined by its past spiking behaviour and the previous spikes of its pre-synaptic neurons. SNNs are trained over sequences of S local algorithmic time-steps, made of D examples of length S’. In an image classification task, an example could be an image encoded as a binary signal.

Figure 2 Example of an internal architecture for an on-device SNN.

In FL-SNN, we cooperatively train distributed on-device SNNs thanks to Federated Learning. To that end, we derived a novel algorithm, for which the time scales involved are summarized in Figure 3. Each global algorithmic iteration t corresponds to Δs local SNN time-steps, and the total number S of SNN local algorithmic time steps and the number T of global algorithmic time steps during the training procedure are hence related as S = DS’ = T∆s.

Figure 3 Illustration of the time scales involved in the cooperative training of SNNs via FL for τ = 3 and ∆s = 4.

Experiments

We consider a classification task based on the MNIST-DVS dataset. The training dataset is composed of 900 examples per class and the test dataset is composed of 100 samples per class. We consider 2 devices which have access to disjoint subsets of the training dataset. In order to validate the advantages of FL, we assume that the first device has only samples from class ‘1’ and the second only from class ‘7’. We train over D = 400 randomly selected examples from the local data sets, which results in S = DS’ = 32,000 local time-steps.

As a baseline, we consider the test loss at convergence for the separate training of the two SNNs. In Figure 4, we plot the local test loss normalized by the mentioned baseline as a function of the global algorithmic time. A larger communication period τ is seen to impair the learning capabilities of the SNNs, yielding a larger final value of the loss. In fact, for τ = 400, after a number of local iterations without communication, the individual devices are not able to make use of their data to improve performance.

Figure 4 Evolution of the mean test loss during training for different values of the communication period τ. Shaded areas represent standard deviations over 3 trials

One of the major flaws of FL is the communication load incurred by the need to regularly transmit large model parameters. To partially explore this aspect, in the paper, we consider exchanging only a subset of synaptic weights during global iterations. We refer to the text at this link for details.

Integrating Wireless Access and Edge Learning

Problem

Figure 1. Delay-constrained edge learning based on data received from a device.

The increasing number of connected devices has led to an explosion in the amounts of data being collected: smartphones, wearable devices and sensors generate data to an extent previously unseen. However, these devices often present power and computational capability constraints that do not allow them to make use of the data – for instance, to train Machine Learning (ML) models. In such circumstances, thanks to mobile edge computing, devices can rely on remote servers to perform the data processing (see Fig. 1). When the amount of data is large, or the access link slow, the amount of time required to transmit the data may be prohibitive. Given a delay constraint on the overall time available for both communication and learning, what is the joint communication-computation strategy that obtains the best performing ML model?

Pipelining communication and computation

Figure 2. Transmission and training protocol.

In a recent work to be published in IEEE Communication Letters, we propose to pipeline communication and computation with an optimized block size. We consider an Empirical Risk Minimization (ERM) problem, for which learning is carried at the server side using Stochastic Gradient Descent (SGD). As the first data block arrives at the server, training of the ML model can start. This continues by fetching data from all the data blocks received thus far. To provide some intuition on the problem of optimizing the block size, communicating the entire data set first reduces the bias of the training process but it may not leave sufficient time for learning. Conversely, transmitting very few samples in each block will bias the model towards the samples sent in the first blocks, as many computation rounds will happen based on these samples.
We determine an upper bound on the expected optimality gap at the end of the time limit, which gives us an indication on how far we are from an optimal model. We can then minimize this bound with regard to the communication block size to obtain an optimized value.

Some results

Figure 3. Training loss versus training time for different values of the block size. Solid line: experimental and theoretical optima.

Numerical experiments allowed us to compare the optimal block size found using the bound with a numerically determined optimal value found by running Monte Carlo experiments over all possible block sizes. Determining the optimal value through an extensive search over the possible block sizes allowed a gain of 3.8% in terms of the final training loss in one of our experiments (see Fig. 3). This small gain comes at the cost of a burdensome parameter optimization that took days on an HPC cluster. Minimizing the proposed bound takes seconds.
We further experimentally determined that our results, which were derived for convex loss functions satisfying the Polyak-Lojasiewicz condition, can be extended to non-convex models. As an example (not found in the paper), we studied the problem of training a multilayer perceptron with non-linear activations according to our scheme (see Fig. 4). Using the same dataset as described in the paper, we train a 2-layers perceptron with ReLU activation for the first layer and linear activation for the second. The experiments show a similar behaviour to the convex example discussed in the main text. In particular, the derived bound predicts well the existence of an optimum value of the block size (see crosses).

Figure 4. Training loss versus block size for different overhead sizes, for an MLP with non-linear activations.

The full paper can be found here.

On the Interplay Between Coded Distributed Inference and Transmission in Mobile Edge Computing Systems

Problem

Introduced by the European Telecommunications Standards Institute (ETSI), the concept of mobile edge computing is by now established as a pillar of the 5G network architecture as an enabler of computation-intensive applications on mobile devices. As illustrated in the figure with mobile edge computing, users offload local data to edge servers connected to wireless Edge Nodes (ENs). The ENs in turn carry out the necessary computations and return the desired output to the users on the wireless downlink.

As a baseline application, assume that each user wishes to compute a linear function Wx of a local data vector x, e.g., an image taken by the user’s camera, and a network-side model matrix W. Each EN acquires the users’ local data points x through uplink transmission at runtime, while the matrix W can be pre-stored at the ENs offline. Matrix W is generally large and hence it is split across the servers of multiple ENs. After the computing phase, the ENs transmit the computed outputs back to the users in the downlink.

Linear operations of the type illustrated above are of practical importance. For example, they underlie the implementation of recommendation systems based on collaborative filtering, or similarity searches based on the cosine distance. In both cases, the user-side data is a vector x that embeds the user profile or a query, and the goal is to search through the matrix of all items on the basis of the inner products between the corresponding row of matrix W and the userdata x.

In the presence of storage redundancy, matrix W can be stored at the ENs in uncoded or coded form. In the first case, the rows of the matrix are duplicated across different ENs. As a result, the ENs can transmit any shared computed output back to the users using cooperative transmission techniques. In contrast, with coding, no cooperation transmission is possible but downlink transmission can start as soon as only a subset of ENs has completed computations. The question main is: How should one balance the robustness to straggling ENs afforded by coding with the cooperative downlink transmission advantages of uncoded repetition storage in order to reduce the overall computation-plus-communication latency?

Some Results

Our work investigates three approaches: Uncoded Storage and Computing (UC), MDS coded Storage and Computing (MC), and a proposed Hybrid Scheme (HS) that concatenates an MDS code with a repetition code. The main contribution of this research is to demonstrate that HS is able to combine the robustness to stragglers afforded by MC and the cooperative downlink transmission advantages of UC.

To illustrate this point, consider the figure where we plot overall communication-plus-computation latency as a function of the ratio γ between the communication and computation latencies. The variability in the computing times is defined by a parameter η. It is observed that as γ increases, the total latencies of both UC and MC grow linearly. When the variability in the computing times of the ENs is high, hence this happens for η=0.8, and MDS coding for the most part outperforms the UC scheme due to its robustness to stragglers. This is unless γ is large enough, in which case downlink transmission latency becomes dominant and the UC scheme can benefit from redundant computations via cooperative EN communication. In contrast, when the computing times have low variability, hence for η=8, MDS coding is uniformly outperformed by the UC scheme. The proposed hybrid coding strategy is seen to be effective in trading off computation and communication latencies by controlling the balance between robustness to stragglers and cooperative opportunities.

The full paper can be found at ieeexplore (open access: arxiv)