Category: distributed computing

Privacy in Wireless Federated Learning is Free

–when the SNR is small enough

Problem Description 

Federated Learning (FL) refers to distributed protocols that avoid direct raw data exchange among the participating devices while training for a common learning task. This way, FL can potentially reduce the information on the local data sets that is leaked via communications. Nevertheless, the model updates shared by the devices may still reveal information about local data. For example, a malicious server could potentially infer the presence of an individual data sample from a learnt model by membership inference attack or model inversion attack. 

Differential privacy (DP) quantifies information leaked about individual data points by measuring the sensitivity of the disclosed statistics to changes in the input data set at a single data point. DP can be guaranteed by introducing a level of uncertainty into the released model that is sufficient to mask the contribution of any individual data point. The most typical approach is to add random perturbations, e.g., Gaussian. This suggests that,  when FL is implemented in wireless systems, the channel noise can directly act as a privacy-inducing mechanism. 

Suggested Solution 

In recent work, we have designed differentially private wireless distributed gradient descent via the direct, uncoded, transmission of gradients from devices to edge server. The channel noise is utilized as a privacy preserving mechanism and dynamic power control is separately optimized for orthogonal multiple access  (OMA) and non-orthogonal multiple access  (NOMA) protocols with the goal of minimizing the learning optimality gap under privacy and power constraints across a given number of communication blocks.  Our recent work to appear in IEEE Journal on Selected Areas in Communications tackles this problem. One of our main results shows that, as long as the privacy constraint level, measured via DP, is below a threshold that decreases with the signal-to-noise ratio (SNR), uncoded transmission achieves privacy “for free”, i.e., without affecting the learning performance. As our analysis demonstrates, channel noise added in the first iterations tends to impact convergence less significantly than the noise added in later iterations, whereas the privacy level depends on a weighted sum of the inverse noise power across the iteration. These properties, captured by compact analytical expressions derived in this paper, are leveraged for adaptive power allocation, yielding significant performance gains over standard static power allocation. 

Some Results 

The performance is first evaluated by using randomly generated synthetic dataset. In the considered range of DP level,  as illustrated in the figure below, NOMA with either adaptive or static power allocation (PA) achieves better performance than OMA. Furthermore, the proposed adaptive PA obtains a significant performance gain over static PA under stringent DP constraints, while the performance advantage of adaptive PA decreases as the DP constraint is relaxed. The figure also shows the threshold values of DP level beyond which the privacy “for free”.  

The performance is also evaluated by MNIST data set as summarized in the last figure. With conventional static PA, the increasing communication budget is seen to largely degrade performance. This is because more communication blocks may cause an increase in privacy loss. In contrast, adaptive PA is able to properly allocate power across the communication blocks thereby achieves a lower training loss.

Federated Neuromorphic Computing

Training state-of-the-art Artificial Neural Network (ANN) models requires distributed computing on large mixed CPU-GPU clusters, typically over many days or weeks, at the expense of massive memory, time, and energy resources, and potentially of privacy violations. Alternative solutions for low-power machine learning on resource-constrained devices have been recently the focus of intense research. In our recently accepted paper at ICASSP 2020, we study the convergence of two such recent lines of inquiries.

On the one hand, Spiking Neural Networks (SNNs) are biologically inspired neural networks in which neurons are dynamic elements processing and communicating via sparse spiking signals over time, rather than via real numbers, enabling the native processing of time-encoded data, e.g., from DVS cameras. They can be implemented on dedicated hardware, offering energy consumptions as low as a few picojoules per spike. A more thorough introduction to probabilistic SNNs can be found in this previous blog post.

On the other hand, Federated Learning (FL) allows devices to carry out collaborative learning without exchanging local data. This makes it possible to train more effective machine learning models by benefiting from data at multiple devices with limited privacy concerns. FL requires devices to periodically exchange information about their local model parameters through a parameter server. It has become de-facto standard for training ANNs over large numbers of distributed devices.

System model

Figure 1 Federated Learning (FL) model under study: Mobile devices collaboratively train on-device SNNs based on different, heterogeneous, and generally unbalanced local data sets, by communicating through a base station (BS).

In our work, as seen in Figure 1, we consider a distributed edge computing architecture in which N mobile devices communicate through a Base Station (BS) in order to perform the collaborative training of local SNN models via FL. Each device holds a different local data set. The goal of FL is to train a common SNN-based model without direct exchange of the data from the local data sets.

FL proceeds in an iterative fashion across T global time-steps. To elaborate, at each global time-step, the devices refine their local model, based on their local datasets. Every τ iterations, they will also transmit their updated local model parameters to the BS, which will in turn compute a centralized averaged parameter and send it back to the devices. This global averaged parameter will be used at the beginning of the next iteration.

An SNN is a network of spiking neurons connected via an arbitrary directed graph, possibly with cycles (see Figure 2). SNNs process information through time, based on a local clock. At each local algorithmic time-step, each neuron receives the signals emitted by the subset of neurons connected to it through directed links, known as synapses. Neurons in the network will then output a binary signal, either ‘0’ or ‘1’. The instantaneous spiking probability of a neuron is determined by its past spiking behaviour and the previous spikes of its pre-synaptic neurons. SNNs are trained over sequences of S local algorithmic time-steps, made of D examples of length S’. In an image classification task, an example could be an image encoded as a binary signal.

Figure 2 Example of an internal architecture for an on-device SNN.

In FL-SNN, we cooperatively train distributed on-device SNNs thanks to Federated Learning. To that end, we derived a novel algorithm, for which the time scales involved are summarized in Figure 3. Each global algorithmic iteration t corresponds to Δs local SNN time-steps, and the total number S of SNN local algorithmic time steps and the number T of global algorithmic time steps during the training procedure are hence related as S = DS’ = T∆s.

Figure 3 Illustration of the time scales involved in the cooperative training of SNNs via FL for τ = 3 and ∆s = 4.


We consider a classification task based on the MNIST-DVS dataset. The training dataset is composed of 900 examples per class and the test dataset is composed of 100 samples per class. We consider 2 devices which have access to disjoint subsets of the training dataset. In order to validate the advantages of FL, we assume that the first device has only samples from class ‘1’ and the second only from class ‘7’. We train over D = 400 randomly selected examples from the local data sets, which results in S = DS’ = 32,000 local time-steps.

As a baseline, we consider the test loss at convergence for the separate training of the two SNNs. In Figure 4, we plot the local test loss normalized by the mentioned baseline as a function of the global algorithmic time. A larger communication period τ is seen to impair the learning capabilities of the SNNs, yielding a larger final value of the loss. In fact, for τ = 400, after a number of local iterations without communication, the individual devices are not able to make use of their data to improve performance.

Figure 4 Evolution of the mean test loss during training for different values of the communication period τ. Shaded areas represent standard deviations over 3 trials

One of the major flaws of FL is the communication load incurred by the need to regularly transmit large model parameters. To partially explore this aspect, in the paper, we consider exchanging only a subset of synaptic weights during global iterations. We refer to the text at this link for details.

Integrating Wireless Access and Edge Learning


Figure 1. Delay-constrained edge learning based on data received from a device.

The increasing number of connected devices has led to an explosion in the amounts of data being collected: smartphones, wearable devices and sensors generate data to an extent previously unseen. However, these devices often present power and computational capability constraints that do not allow them to make use of the data – for instance, to train Machine Learning (ML) models. In such circumstances, thanks to mobile edge computing, devices can rely on remote servers to perform the data processing (see Fig. 1). When the amount of data is large, or the access link slow, the amount of time required to transmit the data may be prohibitive. Given a delay constraint on the overall time available for both communication and learning, what is the joint communication-computation strategy that obtains the best performing ML model?

Pipelining communication and computation

Figure 2. Transmission and training protocol.

In a recent work to be published in IEEE Communication Letters, we propose to pipeline communication and computation with an optimized block size. We consider an Empirical Risk Minimization (ERM) problem, for which learning is carried at the server side using Stochastic Gradient Descent (SGD). As the first data block arrives at the server, training of the ML model can start. This continues by fetching data from all the data blocks received thus far. To provide some intuition on the problem of optimizing the block size, communicating the entire data set first reduces the bias of the training process but it may not leave sufficient time for learning. Conversely, transmitting very few samples in each block will bias the model towards the samples sent in the first blocks, as many computation rounds will happen based on these samples.
We determine an upper bound on the expected optimality gap at the end of the time limit, which gives us an indication on how far we are from an optimal model. We can then minimize this bound with regard to the communication block size to obtain an optimized value.

Some results

Figure 3. Training loss versus training time for different values of the block size. Solid line: experimental and theoretical optima.

Numerical experiments allowed us to compare the optimal block size found using the bound with a numerically determined optimal value found by running Monte Carlo experiments over all possible block sizes. Determining the optimal value through an extensive search over the possible block sizes allowed a gain of 3.8% in terms of the final training loss in one of our experiments (see Fig. 3). This small gain comes at the cost of a burdensome parameter optimization that took days on an HPC cluster. Minimizing the proposed bound takes seconds.
We further experimentally determined that our results, which were derived for convex loss functions satisfying the Polyak-Lojasiewicz condition, can be extended to non-convex models. As an example (not found in the paper), we studied the problem of training a multilayer perceptron with non-linear activations according to our scheme (see Fig. 4). Using the same dataset as described in the paper, we train a 2-layers perceptron with ReLU activation for the first layer and linear activation for the second. The experiments show a similar behaviour to the convex example discussed in the main text. In particular, the derived bound predicts well the existence of an optimum value of the block size (see crosses).

Figure 4. Training loss versus block size for different overhead sizes, for an MLP with non-linear activations.

The full paper can be found here.

On the Interplay Between Coded Distributed Inference and Transmission in Mobile Edge Computing Systems


Introduced by the European Telecommunications Standards Institute (ETSI), the concept of mobile edge computing is by now established as a pillar of the 5G network architecture as an enabler of computation-intensive applications on mobile devices. As illustrated in the figure with mobile edge computing, users offload local data to edge servers connected to wireless Edge Nodes (ENs). The ENs in turn carry out the necessary computations and return the desired output to the users on the wireless downlink.

As a baseline application, assume that each user wishes to compute a linear function Wx of a local data vector x, e.g., an image taken by the user’s camera, and a network-side model matrix W. Each EN acquires the users’ local data points x through uplink transmission at runtime, while the matrix W can be pre-stored at the ENs offline. Matrix W is generally large and hence it is split across the servers of multiple ENs. After the computing phase, the ENs transmit the computed outputs back to the users in the downlink.

Linear operations of the type illustrated above are of practical importance. For example, they underlie the implementation of recommendation systems based on collaborative filtering, or similarity searches based on the cosine distance. In both cases, the user-side data is a vector x that embeds the user profile or a query, and the goal is to search through the matrix of all items on the basis of the inner products between the corresponding row of matrix W and the userdata x.

In the presence of storage redundancy, matrix W can be stored at the ENs in uncoded or coded form. In the first case, the rows of the matrix are duplicated across different ENs. As a result, the ENs can transmit any shared computed output back to the users using cooperative transmission techniques. In contrast, with coding, no cooperation transmission is possible but downlink transmission can start as soon as only a subset of ENs has completed computations. The question main is: How should one balance the robustness to straggling ENs afforded by coding with the cooperative downlink transmission advantages of uncoded repetition storage in order to reduce the overall computation-plus-communication latency?

Some Results

Our work investigates three approaches: Uncoded Storage and Computing (UC), MDS coded Storage and Computing (MC), and a proposed Hybrid Scheme (HS) that concatenates an MDS code with a repetition code. The main contribution of this research is to demonstrate that HS is able to combine the robustness to stragglers afforded by MC and the cooperative downlink transmission advantages of UC.

To illustrate this point, consider the figure where we plot overall communication-plus-computation latency as a function of the ratio γ between the communication and computation latencies. The variability in the computing times is defined by a parameter η. It is observed that as γ increases, the total latencies of both UC and MC grow linearly. When the variability in the computing times of the ENs is high, hence this happens for η=0.8, and MDS coding for the most part outperforms the UC scheme due to its robustness to stragglers. This is unless γ is large enough, in which case downlink transmission latency becomes dominant and the UC scheme can benefit from redundant computations via cooperative EN communication. In contrast, when the computing times have low variability, hence for η=8, MDS coding is uniformly outperformed by the UC scheme. The proposed hybrid coding strategy is seen to be effective in trading off computation and communication latencies by controlling the balance between robustness to stragglers and cooperative opportunities.

The full paper can be found at ieeexplore (open access: arxiv)