Privacy in Wireless Federated Learning is Free

–when the SNR is small enough

Problem Description 

Federated Learning (FL) refers to distributed protocols that avoid direct raw data exchange among the participating devices while training for a common learning task. This way, FL can potentially reduce the information on the local data sets that is leaked via communications. Nevertheless, the model updates shared by the devices may still reveal information about local data. For example, a malicious server could potentially infer the presence of an individual data sample from a learnt model by membership inference attack or model inversion attack. 

Differential privacy (DP) quantifies information leaked about individual data points by measuring the sensitivity of the disclosed statistics to changes in the input data set at a single data point. DP can be guaranteed by introducing a level of uncertainty into the released model that is sufficient to mask the contribution of any individual data point. The most typical approach is to add random perturbations, e.g., Gaussian. This suggests that,  when FL is implemented in wireless systems, the channel noise can directly act as a privacy-inducing mechanism. 

Suggested Solution 

In recent work, we have designed differentially private wireless distributed gradient descent via the direct, uncoded, transmission of gradients from devices to edge server. The channel noise is utilized as a privacy preserving mechanism and dynamic power control is separately optimized for orthogonal multiple access  (OMA) and non-orthogonal multiple access  (NOMA) protocols with the goal of minimizing the learning optimality gap under privacy and power constraints across a given number of communication blocks.  Our recent work to appear in IEEE Journal on Selected Areas in Communications tackles this problem. One of our main results shows that, as long as the privacy constraint level, measured via DP, is below a threshold that decreases with the signal-to-noise ratio (SNR), uncoded transmission achieves privacy “for free”, i.e., without affecting the learning performance. As our analysis demonstrates, channel noise added in the first iterations tends to impact convergence less significantly than the noise added in later iterations, whereas the privacy level depends on a weighted sum of the inverse noise power across the iteration. These properties, captured by compact analytical expressions derived in this paper, are leveraged for adaptive power allocation, yielding significant performance gains over standard static power allocation. 

Some Results 

The performance is first evaluated by using randomly generated synthetic dataset. In the considered range of DP level,  as illustrated in the figure below, NOMA with either adaptive or static power allocation (PA) achieves better performance than OMA. Furthermore, the proposed adaptive PA obtains a significant performance gain over static PA under stringent DP constraints, while the performance advantage of adaptive PA decreases as the DP constraint is relaxed. The figure also shows the threshold values of DP level beyond which the privacy “for free”.  

The performance is also evaluated by MNIST data set as summarized in the last figure. With conventional static PA, the increasing communication budget is seen to largely degrade performance. This is because more communication blocks may cause an increase in privacy loss. In contrast, adaptive PA is able to properly allocate power across the communication blocks thereby achieves a lower training loss.

Sensing, Communicating, and Classifying with Spikes

Or how to remotely classify data with 80% accuracy and zero latency at Signal-to-Noise Ratios (SNRs) as low as – 8 dB.

The development of Internet of Things (IoT) systems, with applications ranging from personal healthcare and wearable devices to drone-based monitoring, is driving research efforts on edge-based machine learning. In such systems, data may be collected by battery-powered sensors and processed at a remote device, which may itself be energy-constrained. Standard hardware implementations pose major energy and latency limitations for such applications.
In our new paper, recently accepted at Asilomar 2020, we investigate a novel solution based on neuromorphic sensors, processors, and transmitter/receivers. In neuromorphic sensors, spikes (i.e., binary signals) mark the occurrence of a relevant event, e.g., a significant change in a pixel for a neuromorphic camera. Extremely low energy is consumed when the monitored scene is idle. In neuromorphic processors, known as Spiking Neural Networks (SNNs), spiking signals are processed via dynamic neural models for the detection of spatio-temporal patterns. SNNs have recently emerged as a biologically plausible alternative to ANNs, with significant benefits in terms of energy efficiency and latency. Finally, for communications, pulses, or spikes, can encode information for radio signalling via low-power Impulse Radio (IR). Commercial products are available for all these blocks, including DVS cameras, Intel’s Loihi SNN chip, and transceivers implementing the IEEE 802.15.4z IR standard.

System model
As seen in Fig. 1, the proposed system consists of the integration of neuromorphic sensing and processing with IR transmission, and it carries out Joint Source-Channel Coding (JSCC), as it performs source and channel coding in a single step. The signal sensed by the neuromorphic sensor, e.g., a DVS camera, is encoded as a vector of binary spiking signals, and processed by an encoding SNN that performs source and channel coding. The SNN defines a probabilistic mapping that is defined by its parameter vector. The output of the encoding SNN is modulated using parallel IR transmissions, with each spike encoded by an IR waveform such as a Gaussian monopulse. The channel is modeled as a frequency-flat Gaussian channel. Finally, the received signals are classified via a decoding SNN whose output can be interpreted as a class index using standard methods for SNN-based classification. For example, rate decoding predicts a class by selecting the neurons in an output layer with the largest number of spikes.
The proposed system in Fig. 1, termed NeuroJSCC, is trained by maximizing the log-likelihood that the decoding SNN outputs desired spiking signals in response to a given input. Details on the training procedure, and the resulting algorithm, can be found in the preprint.


To illustrate the advantage of the system, we focus on an example consisting of the remote detection of handwritten digits recorded by a neuromorphic camera.
We compare NeuroJSCC to two benchmark schemes:
1) Uncoded transmission: The observation is directly transmitted through the Gaussian channel using On-Off Shift Keying (OOK), and classified using an SNN.
2) Separate Source-Channel Coding (SSCC): The encoder applies state-of-the-art quantization based on the Vector Quantization Variational Autoencoder (VQ-VAE) scheme, followed by LDPC encoding. The spiking signal is encoded as frames, and the scheme is applied separately to each one of them. At the decoder side, frames are decoded using the Belief Propagation algorithm, decompressed using VQ-VAE decoding, and then classified. We consider two different classifiers, namely traditional ANN and SNN.
In Fig. 3, we evaluate the test accuracies at convergence obtained for different levels of SNR and the different schemes. The accuracy of Uncoded transmission drops sharply at sufficiently low SNR levels. In contrast, NeuroJSCC maintains a test accuracy of 80%, even at an SNR level as low as −8 dB. Separate SCC with an SNN as classifier suffers the most from the degradation of the SNR. Using an ANN proves more robust to low SNR levels, since an ANN can benefit from the non binary outputs of the VQ-VAE decoder without further loss of information due to binary quantization.
We refer to the main text for further experiments and analysis.
Code will be released shortly on our Github page.

Address-Event Variable Length Compression for Time-Encoded Data

Illustration of the problem of variable length address-event compression with 3 traces:  At each events’ occurrence time T_n, the encoder outputs a variable-length packet describing the set of ‘addresses’ of the traces.

Problem Description

The information age has relied on digital information processing: audio, video, and text are represented as strings of bits. Biological brains, however, process information in the timing of events, also known as spikes. Time-encoded information underlies many data types of increasing practical importance, such as social network update times, communication network logs, retweet traces, wireless activity sensors, neuromorphic sensors, and synaptic traces from in-brain measurements for brain-computer interfaces. For example, neuromorphic cameras encode information by producing a spike in response to changes in the sensed environment; and neurons in a Spiking Neural Networks (SNNs) compute and communicate via spiking traces in a way that mimics the operation of biological brains.

When time encoded data is processed at a remote site with respect to the location in which the data is produced, the occurrence of events needs to be encoded and transmitted in a timely fashion. This is particularly relevant in SNN chips for which neurons are partitioned into several cores and spikes produced by neurons in a given core need to be conveyed to the recipient neurons in a separate core in order to enable correct processing.  Spikes in SNN’s are encoded into packets through Address Event Representation (AER) protocol. With AER, a packet encoding the occurrence of one or more events is produced at the same time in which the events take place. Thus, the spike timing information is directly carried by the reception of the packet. Therefore, assuming that the packet is detected by the receiver with negligible delay, the packet payload only needs to contain the information about the identity, also referred to as “addresses”, of the “spiking” traces.

A close-up shot of Intel Nahuku board, each of which contains 8 to 32 Intel Loihi neuromorphic chips. (Credit: Tim Herman/Intel Corporation)

In our recent paper accepted for presentation at the IEEE International Symposium on Information Theory and Applications (ISITA 2020), we study the problem of compressing packets generated by an AER-like protocol for generic time-encoded data. This could help alleviate communication bottlenecks in systems that rely on time-encoded data processing.

Suggested Solution

The key idea is that time-encoded traces are characterized by strong correlations both over time and across different traces. These intra-and inter-trace correlations can be harnessed to compress, using variable length codes, the addresses of the event producing traces at a given time.

Towards this, we first model time-encoded data with multiple traces as a discrete-time multi-variate Hawkes process that captures the inter- and intra-trace correlations. This allows formulating the address-event compression problem in terms of the parameters of the discrete-time Hawkes process. Finally, the variable-length compression of packets is achieved through entropy coding via conditional codebooks. The details of the problem modeling can be found here.

Experiment on Real-World Dataset

We implemented the proposed variable-length scheme on a real-world retweet dataset. The dataset consists of retweet sequences, each corresponding to the retweets of an original tweet. Each retweet event in a sequence is marked with the type of user group (‘small’, ‘medium’ or, ‘large’) and with the time (quantized to an integer) elapsed since the original tweet. Accordingly, each sequence can be formatted into 3 discrete-time traces. For our experiments, we sampled 2100 sequences from the data set with 2000 sequences used for training and 100 for testing. The training set is used to fit the parameters of the discrete-time Hawkes process. After training, the test sequences are used to evaluate the average number of bits per event, using the trained variable-length code. We study three scenarios: (i) compression with both inter-and intra-trace correlations; (ii) “compression with an i.i.d. model” which assumes the traces to be independent; and (iii) “compression with intra-trace correlation”, which assumes independent traces that are allowed to correlate across time.  As seen in the figure below, compared to a no-compression scheme that requires approximately 2.8 bits per event, we find that compression with an i.i.d. model requires only 1.22 bits per event, a gain of 57% over no-compression. Further reductions in rates result from compression schemes that assume intra-trace correlation across time, particularly if accounting also for inter-trace correlations.


Information-Centric Grant-Free Access for IoT Fog Networks: Edge vs Cloud Detection and Learning


With the advent of 5G, cellular systems are expected to play an increasing role in enabling Internet of Things (IoT). This is partly due to the introduction of NarrowBand IoT (NB-IoT), a cellular-based radio technology allowing low-cost and long-battery life connections, in addition to other IoT protocols that operate in the unlicensed band such as LoRa. However, these protocols allow for a successful transmission only when a radio resource is used by a single IoT device. Therefore, generally, the amount of resources needed scales with the number of active devices. This poses a serious challenge in enabling massive connectivity in future cellular systems. In our recent IEEE Transactions on Wireless Communications paper, we tackle this issue.

Figure 1: A Fog-Radio architecture where processing information from IoT devices, denoted by the theta symbol,  can take place either at the Cloud or the Edge Node.

Suggested Solution

In our new paper, we propose an information-centric radio access technique where IoT devices making (roughly) the same observation of a given monitored quantity, e.g., temperature, transmit using the same radio resource, i.e., in a non-orthogonal fashion. Thus, the number of radio resources needed scales with the number of possible relevant values observed, e.g., high or low temperature and not with the number of devices.

Cellular networks are evolving toward Fog-Radio architectures, as shown in Figure 1. In these systems, instead of the entire processing happening at the edge node, radio access related functionalities can be distributed between the cloud and the edge. We propose that detection in the IoT system under study be implemented at either cloud or edge depending on backhaul conditions and on the statistics of the observations.

Some Results

One of the important findings of this work is that cloud detection is able to leverage inter-cell interference in order to improve detection performance, as shown in the figure below. This is mainly due to the fact that devices transmitting the same values in different cells are non-orthogonally superposed and thus, the cloud can detect these values with higher confidence.

More details and results can be found in the complete version of the paper here.

Coding and Lazy Aggregation for Robust and Efficient Distributed Learning


Figure 1: Parameter Server (PS) computing architecture.

Problem Overview:

In order to scale machine learning so as to cope with large volumes of input data, distributed implementations of gradient-based methods, e.g., Gradient Descent (GD), that leverage the parallelism of first-order optimization techniques are commonly adopted. To run GD, as illustrated in Fig.~\ref{fig:model}, multiple parallel workers perform computations of the gradients and the Parameter Server (PS) iteratively aggregates the computed gradients and communicates the updated parameter back to the workers. In the process, the PS computing architecture is subject to two key impairments. First, the potentially high tail of the distribution of the computing times at the workers can cause significant slowdowns in wall-clock run-time per iteration due to straggling workers. Second, the communication overhead resulting from intensive two-way communications between the PS and the workers may require significant networking resources to be available in order not to dominate the overall run-time.

To jointly address these impairments, in a recent work just published on IEEE Transactions on Neural Networks and Learning Systems, we study the performance of coding and lazy aggregation techniques for the PS architecture in terms of wall-clock run-time complexity, communication complexity, and computation complexity.

Main Results:

To explore the trade-off among wall-clock time, communication, and computation requirements, we provide a unified analysis of the techniques of gradient coding (GC), worker grouping, and adaptive worker selection, also known as Lazily Aggregated Gradient (LAG), whose relative merits are summarized in Table I. Both GC and grouping are full-gradient approaches that aim at increasing robustness to stragglers by leveraging storage and computation redundancy. Thanks to coding, with GC, only a given number of workers, dependent on the computing redundancy, need to finish their computations and send their encoded computed gradients to the PS at each iteration in order to retrieve the gradient. Grouping applies data duplication and coding to groups of workers. In contrast, LAG is an approximate gradient descent scheme that judiciously selects a subset of active workers at each iteration in order to reduce communication and computation loads. By integrating all the techniques,
we propose a novel strategy, named Lazily Aggregated Gradient Coding (LAGC), that aims at exploring the trade-off between the robustness to stragglers of GC and the computation and communication efficiency of LAG by generalizing both schemes.

Figure 2: Time, communication, and computation complexity measures under the Pareto distribution.

Figure 3: Time, communication, and computation complexity measures under the exponential distribution.

As a special case, we also introduce a scheme that only uses grouping and adaptive selection, which is referred to as G-LAG. For illustration, we consider a linear regression model under two representative distributions, i.e., Pareto distribution and exponential distribution, accounting high- and low-tails for the distribution of the computing times for the workers. Time, communication and computation complexities of the existing strategies, namely GD, GC, and LAG, and the proposed strategies, i.e., LAGC and G-LAG, are shown in Fig. 2 and Fig. 3. It can be seen that both of the proposed LAGC and G-LAG are capable of combining the benefits of gradient coding and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection (see Table I). Furthermore, G-LAG provides the best wall-clock time and communication performance, while maintaining a low computational cost.
The full paper can be found here.


Federated Neuromorphic Computing

Training state-of-the-art Artificial Neural Network (ANN) models requires distributed computing on large mixed CPU-GPU clusters, typically over many days or weeks, at the expense of massive memory, time, and energy resources, and potentially of privacy violations. Alternative solutions for low-power machine learning on resource-constrained devices have been recently the focus of intense research. In our recently accepted paper at ICASSP 2020, we study the convergence of two such recent lines of inquiries.

On the one hand, Spiking Neural Networks (SNNs) are biologically inspired neural networks in which neurons are dynamic elements processing and communicating via sparse spiking signals over time, rather than via real numbers, enabling the native processing of time-encoded data, e.g., from DVS cameras. They can be implemented on dedicated hardware, offering energy consumptions as low as a few picojoules per spike. A more thorough introduction to probabilistic SNNs can be found in this previous blog post.

On the other hand, Federated Learning (FL) allows devices to carry out collaborative learning without exchanging local data. This makes it possible to train more effective machine learning models by benefiting from data at multiple devices with limited privacy concerns. FL requires devices to periodically exchange information about their local model parameters through a parameter server. It has become de-facto standard for training ANNs over large numbers of distributed devices.

System model

Figure 1 Federated Learning (FL) model under study: Mobile devices collaboratively train on-device SNNs based on different, heterogeneous, and generally unbalanced local data sets, by communicating through a base station (BS).

In our work, as seen in Figure 1, we consider a distributed edge computing architecture in which N mobile devices communicate through a Base Station (BS) in order to perform the collaborative training of local SNN models via FL. Each device holds a different local data set. The goal of FL is to train a common SNN-based model without direct exchange of the data from the local data sets.

FL proceeds in an iterative fashion across T global time-steps. To elaborate, at each global time-step, the devices refine their local model, based on their local datasets. Every τ iterations, they will also transmit their updated local model parameters to the BS, which will in turn compute a centralized averaged parameter and send it back to the devices. This global averaged parameter will be used at the beginning of the next iteration.

An SNN is a network of spiking neurons connected via an arbitrary directed graph, possibly with cycles (see Figure 2). SNNs process information through time, based on a local clock. At each local algorithmic time-step, each neuron receives the signals emitted by the subset of neurons connected to it through directed links, known as synapses. Neurons in the network will then output a binary signal, either ‘0’ or ‘1’. The instantaneous spiking probability of a neuron is determined by its past spiking behaviour and the previous spikes of its pre-synaptic neurons. SNNs are trained over sequences of S local algorithmic time-steps, made of D examples of length S’. In an image classification task, an example could be an image encoded as a binary signal.

Figure 2 Example of an internal architecture for an on-device SNN.

In FL-SNN, we cooperatively train distributed on-device SNNs thanks to Federated Learning. To that end, we derived a novel algorithm, for which the time scales involved are summarized in Figure 3. Each global algorithmic iteration t corresponds to Δs local SNN time-steps, and the total number S of SNN local algorithmic time steps and the number T of global algorithmic time steps during the training procedure are hence related as S = DS’ = T∆s.

Figure 3 Illustration of the time scales involved in the cooperative training of SNNs via FL for τ = 3 and ∆s = 4.


We consider a classification task based on the MNIST-DVS dataset. The training dataset is composed of 900 examples per class and the test dataset is composed of 100 samples per class. We consider 2 devices which have access to disjoint subsets of the training dataset. In order to validate the advantages of FL, we assume that the first device has only samples from class ‘1’ and the second only from class ‘7’. We train over D = 400 randomly selected examples from the local data sets, which results in S = DS’ = 32,000 local time-steps.

As a baseline, we consider the test loss at convergence for the separate training of the two SNNs. In Figure 4, we plot the local test loss normalized by the mentioned baseline as a function of the global algorithmic time. A larger communication period τ is seen to impair the learning capabilities of the SNNs, yielding a larger final value of the loss. In fact, for τ = 400, after a number of local iterations without communication, the individual devices are not able to make use of their data to improve performance.

Figure 4 Evolution of the mean test loss during training for different values of the communication period τ. Shaded areas represent standard deviations over 3 trials

One of the major flaws of FL is the communication load incurred by the need to regularly transmit large model parameters. To partially explore this aspect, in the paper, we consider exchanging only a subset of synaptic weights during global iterations. We refer to the text at this link for details.

Using Machine learning to Measure Intrinsic and Synergistic Information Flows


Quantifying the causal flow of information between different components of a system is an important task for many natural and engineered systems, such as neural, genetic, transportation and social networks. A well-established metric of the information flow between two time sequences  and  that has been widely applied for this purpose is the information-theoretic measure of Transfer Entropy (TE). The TE equals the mutual information between the past of sequence  and the current value at time t when conditioning on the past of . However, the TE has limitations as a measure of intrinsic, or exclusive, information flow from sequence to sequence . In fact, as pointed out in this paper, the TE captures not only the amount of information on that is contained in the past of in addition to that already present in the past of , but also the information about that is obtained only when combining the past of both and . Only the first type of information flow may be defined as intrinsic, while the second can be thought of as a synergistic flow of information involving both sequences.

In the same paper, the authors propose to decompose the TE as the sum of an Intrinsic TE (ITE) and a Synergistic TE (STE), and introduce a measure of the ITE based on cryptography. The idea is to measure the ITE as the size (in bits) of a secret key that can be generated by two parties, one holding the past of sequence and the other , via public communication, when the adversary has the past of sequence .

The computation of ITE is generally intractable. To estimate ITE, in recent work, we proposed an estimator, referred to as ITE Neural Estimator (ITENE), of the ITE that is based on variational bound on the KL divergence, two-sample neural network classifiers, and the pathwise estimator of Monte Carlo gradients.


Some Results

We first apply the proposed estimator to the following toy example. The joint processes are generated according to

for some threshold λ, where variables are independent and identically distributed as .  Intuitively, for large values of the threshold λ, there is no information flow between  and , while for small values, there is a purely intrinsic flow of information. For intermediate values of λ, the information flow is partly synergistic, since knowing both and is instrumental in obtaining

Figure 1


information about .  As illustrated in Fig. 1, the results obtained from the estimator are consistent with this intuition.


Figure 2

For a real-world example, we apply the estimators at hand to historic data of the values of the Hang Seng Index (HSI) and of the Dow Jones Index (DJIA) between 1990 and 2011 (see Fig. 2). As illustrated in Fig. 3, both the TE and ITE from the DJIA to the HSI are much larger than in the reverse direction, implying that the DJIA influenced the HSI more significantly than the other way around for the

Figure 3

given time range. Furthermore, we observe that not all the information flow is estimated to be intrinsic, and hence the joint observation of the history of the DJIA and of the HSI is partly responsible for the predictability of the HSI from the DJIA.

The full paper will be presented at 2020 International Zurich Seminar on Information and Communication and can be found here.

Compute With Time, Not Over It: An Introduction to Spiking Neural Networks


Artificial Neural Networks (ANNs) have become the de-facto standard tool to carry out supervised, unsupervised, and reinforcement learning tasks. Their recent successes have built upon various algorithmic advances, but have also heavily relied on the unprecedented availability of computing power and memory in data centers and cloud computing platforms. The resulting considerable energy requirements run counter to the constraints imposed by implementations on low-power mobile or embedded devices for applications such as personal health monitoring or neural prosthetics.

How can the human brain perform general and complex tasks at a minute fraction of the power required by state-of-the-art supercomputers and ANN-based models? Neurons in the human brain are different from those in an ANN: they process and communicate using sparse spiking signals over time, rather than real numbers; and they are dynamic devices, rather than static non-linearites (see, Figure 1). Taking inspiration from this observation, Spiking Neural Networks (SNNs) have been introduced in the theoretical neuroscience literature as networks of dynamic spiking neurons that enables efficient on-line inference learning. SNNs have the unique capability to process information encoded in the timing of spikes, with the energy per spike being as a few picojoules. Proof-of-concept and commercial hardware implementations of SNNs (e.g., Intel, IBM) have demonstrated orders-of-magnitude improvements in terms of energy efficiency over ANNs.

Figure 1. Illustration of neural networks: (left) an ANN, where each neuron processes real numbers; and (right) an SNN, where dynamic spiking neurons process and communicate binary sparse spiking signals over time.

The most common SNN model consists of a network of neurons with deterministic dynamics, e.g., leaky-integrate-and-fire model, whereby a spike is emitted as soon as an internal state variable, known as membrane potential, crosses a given threshold value. Learning problems should be formulated as the minimization of a loss function that directly accounts for the timing of the spikes emitted by the neurons. While this minimization can be done using Stochastic Gradient Descent (SGD) as for ANNs, it is made challenging by the non-differentiability of the behavior of spiking neurons with respect to the synaptic weights. In contrast to deterministic models, a probabilistic model for SNNs defines the outputs of all spiking neurons as differentiable joint distributed binary random processes. A probabilistic viewpoint has hence significant analytic advantages in that we can apply flexible learning rules from the principled learning criteria such as likelihood and mutual information.

Some Results

Our recent work published on IEEE Signal Processing Magazine (SPM) Special Issue on Learning Algorithms and Signal Processing for Brain-Inspired Computing provides a review on the topic of probabilistic SNNs with a specific focus on the most commonly used Generalized Linear Models (GLMs) by covering probabilistic models, learning rules, and applications.

Figure 2. Illustration of the neurons with probabilistic dynamics with exponential feedforward and feedback kernels.

As illustrated in Figure 2, in a GLM, any post-synaptic neuron i receives the signals emitted by pre-synaptic neurons through synapses. Its internal state, or the probability to spike, is defined by membrane potential, which is the sum of contributions from the incoming spikes of the pre-synaptic neurons and from the past spiking behavior of the neuron itself, where both contributions are filtered by feedforward and feedback kernels, respectively. Under the GLM, the gradient of the log-likelihood of the spiking signals depends on the difference between the desired spiking behavior and its average behavior under the model.

SNNs can be trained using supervised, unsupervised, and reinforcement learning, by following a learning rule. This defines how the model parameters are updated on the basis of the available observations – in a batch mode or in an on-line fashion. Our work derives Maximum Likelihood learning rules using SGD in a batch and on-line mode, for both fully observed and partially observed SNNs. The learning rules can be interpreted in light of the general form of the three-factor rule; the synaptic weight wj,i from pre-synaptic neuron j to a post-synaptic neuron i is updated as wj,i ← wj,i + η × ℓ × pre(j) × post(i), where η is a learning rate; is a scalar global learning signal which is absent in case of fully observed SNNs; pre(j) is given by the filtered feedforward trace of the pre-synaptic neuron j; and post(i) is given by the error term of the post-synaptic neuron i, appeared in the gradient above. In case of partially observed SNNs, variational inference is needed to approximate the true posterior distribution by means of variational posterior. With a feedforward distribution for the variational posterior, we derive the learning rule using doubly SGD, whereby the global learning signal is obtained by sampling spike signals of unobserved neurons.

Figure 3. On-line prediction task based on an SNN with 9 visible and 2 hidden neurons; (left, top) real, analog time signal (dashed) and predicted, decoded signal (solid); (left, bottom) total number of spikes emitted by the SNN; and (right) spike raster plot of the SNN.

Experiments on an on-line prediction task allowed us to observe the potential of SNNs for ‘always-on’ event-driven applications. The SNN observes a time sequence and is trained to predict the next value of sequence given the observation of the previous values, where the time sequence is encoded in the spike domain with ΔT spike samples per each value of the sequence. In Figure 3, the SNN is seen to be able to provide an accurate prediction (left, top) with the corresponding number of spikes (left, bottom) and spikes emitted by the SNN (right). To demonstrate the efficiency benefits of SNNs that may arise from their unique time encoding capabilities, we also compare the prediction error and the number of spikes, with rate and time coding schemes.

Please refer to the full paper at IEEE Xplore (open access: arXiv) for details. The tutorial for learning algorithms and signal processing for brain-inspired computing can be found at IEEE Xplore.

Integrating Wireless Access and Edge Learning


Figure 1. Delay-constrained edge learning based on data received from a device.

The increasing number of connected devices has led to an explosion in the amounts of data being collected: smartphones, wearable devices and sensors generate data to an extent previously unseen. However, these devices often present power and computational capability constraints that do not allow them to make use of the data – for instance, to train Machine Learning (ML) models. In such circumstances, thanks to mobile edge computing, devices can rely on remote servers to perform the data processing (see Fig. 1). When the amount of data is large, or the access link slow, the amount of time required to transmit the data may be prohibitive. Given a delay constraint on the overall time available for both communication and learning, what is the joint communication-computation strategy that obtains the best performing ML model?

Pipelining communication and computation

Figure 2. Transmission and training protocol.

In a recent work to be published in IEEE Communication Letters, we propose to pipeline communication and computation with an optimized block size. We consider an Empirical Risk Minimization (ERM) problem, for which learning is carried at the server side using Stochastic Gradient Descent (SGD). As the first data block arrives at the server, training of the ML model can start. This continues by fetching data from all the data blocks received thus far. To provide some intuition on the problem of optimizing the block size, communicating the entire data set first reduces the bias of the training process but it may not leave sufficient time for learning. Conversely, transmitting very few samples in each block will bias the model towards the samples sent in the first blocks, as many computation rounds will happen based on these samples.
We determine an upper bound on the expected optimality gap at the end of the time limit, which gives us an indication on how far we are from an optimal model. We can then minimize this bound with regard to the communication block size to obtain an optimized value.

Some results

Figure 3. Training loss versus training time for different values of the block size. Solid line: experimental and theoretical optima.

Numerical experiments allowed us to compare the optimal block size found using the bound with a numerically determined optimal value found by running Monte Carlo experiments over all possible block sizes. Determining the optimal value through an extensive search over the possible block sizes allowed a gain of 3.8% in terms of the final training loss in one of our experiments (see Fig. 3). This small gain comes at the cost of a burdensome parameter optimization that took days on an HPC cluster. Minimizing the proposed bound takes seconds.
We further experimentally determined that our results, which were derived for convex loss functions satisfying the Polyak-Lojasiewicz condition, can be extended to non-convex models. As an example (not found in the paper), we studied the problem of training a multilayer perceptron with non-linear activations according to our scheme (see Fig. 4). Using the same dataset as described in the paper, we train a 2-layers perceptron with ReLU activation for the first layer and linear activation for the second. The experiments show a similar behaviour to the convex example discussed in the main text. In particular, the derived bound predicts well the existence of an optimum value of the block size (see crosses).

Figure 4. Training loss versus block size for different overhead sizes, for an MLP with non-linear activations.

The full paper can be found here.

Meta-learning: A new framework for few-pilot transmission in IoT networks


Fig. 1: Illustration of few-pilot training for an IoT system via meta-learning

For channels with an unknown model or an unavailable optimal receiver of manageable complexity, the design of demodulation and decoding can potentially benefit from a data-driven approach based on machine learning. Machine learning solutions, however, cannot be directly applied to Internet- of-Things (IoT) scenarios in which devices transmit sporadically using short packets with few pilot symbols. In fact, the few pilots do not provide enough data for training the receiver.

A Novel Solution based on Meta-learning

Fig. 2: MAML is to find an initial value 𝜃 that minimizes the loss L𝑘(θ´𝑘) for all devices 𝑘 after one step of update. In contrast, joint training carries out an optimization on the cumulative loss              L1(θ) + L2(θ) 

In a recent work to be presented at IEEE SPAWC 2019, we proposed a novel solution for demodulation in IoT networks that is based on model-agnostic meta-learning (MAML) algorithm. The key idea is to use pilots from previous transmissions of other IoT devices as meta- training data in order to learn a demodulator that is able to quickly adapt to the end-to-end channel conditions of a new device from few pilots. MAML derives an inductive bias as an initialization point for a neural network-based demodulator. As illustrated in Fig. 2, MAML seeks an initialization point such that all the performance losses of the demodulators for all IoT devices obtained after one update are collectively minimized. In comparison, a more conventional approach to use meta-training data, namely joint training, would pool together all the pilots received from the meta-training devices and seeks for minimizing the cumulative loss.

Some Results

To give a taste of the results in the paper, we now provide an example.

Fig. 3: Probability of symbol error with respect to number of pilots for the  meta-test device (see paper).

In Fig. 3, we plot probability of symbol error with respect to the number of pilots for new IoT device in offline scenario. We adopt 16-QAM with 100 meta-training devices, each with 32 pilots for meta-training. We compare the performance of state-of-the-art meta-learning approaches including MAML with: (i) a fixed initialization scheme where data from the meta-training devices is not used; (ii) joint training with the meta-training dataset as described above.

All of the various meta-learning schemes are seen to vastly outperform the mentioned baseline approaches (i) – (ii) by adapting to the channel of the meta-test device using only a few pilots. In contrast, joint training shows similar performance compared to fixed initialization. This confirms that, unlike conventional solutions, meta-learning can effectively transfer information from meta-training devices to a new target device.


Fig. 4: Average probability of symbol error with respect to average number of pilots over slots t=71, …, 90 for online meta-learning (see paper).

In Fig. 4, we plot probability of symbol error with respect to average number of pilots in online scenario. Through comparison with fixed initialization case, we have shown that proposed adaptive pilot number selection scheme can reduce pilot overhead with any online schemes. Moreover, when proposed scheme comes with online meta-learning, we show that pilot overhead is reduced even more under negligible performance degradation. This again confirms that meta-learning can acquire useful inductive bias from previous IoT devices.

The full paper can be found here.

« Older posts