Category: Uncategorized

Learning How to Adapt Power Control in Dynamic Communication Networks

Problem

An essential property of any wireless channel is the fact that it is a shared medium, much like the air through which sound propagates is shared among the participants of a conversation. As a result, communication engineers must deal with the resulting interference,  which may substantially limit the reliability and the achievable rates in a wireless communication system. A proven remedy is to adapt the transmission power to current channel conditions, which was successfully addressed by the data-driven methodology introduced in [1] in which the power control policy is parametrized by a random edge graph neural network (REGNN).

In our recent work to be presented at SPAWC 2021, we focus on the higher-level problem of facilitating adaptation of the power control policy. We consider the case where the topology of the network varies across periods of operation of the system, with each period being in turn characterized by time-varying channel conditions. In order to facilitate fast adaptation of the power control policy — in terms of data and iteration requirements — we integrate meta-learning with REGNN training.

Meta-learning Solution

Our meta-learning solution leverages channel state information (CSI) data from a number of previous periods to optimize an adaptation procedure that facilitates fast adaptation on a new topology to be encountered in a future period. We specifically adopt first-order meta-learning methods, namely first-order model agnostic meta-learning (FOMAML) [2] and REPTILE [3] that parametrize the adaptation procedure via its initialization within each period. While GNNs are known to be robust to changes in the topology, the proposed integration of meta-learning and REGNNs is shown to offer significant improvements in terms of sample and iteration efficiency.

Fig 1. Sum rate as a function of the number of samples used for adaptation, for a network with dynamic size.

Some Results

The achievable sum rate with respect to the number of CSI samples used for adaptation is illustrated in Fig. 1 for a network in which the number of transmitters and receivers changes in each period. Meta-learning, via both FOMAML and REPTILE, is seen to adapt quickly to the new topology, outperforming conventional REGNN, even when allowing for fine-tuning of the later. This significant improvement can be attributed to the variability of the topologies observed across periods in the considered scenario, which makes the joint training approach in [1] ineffective. That said, when the number of samples for adaptation is sufficiently large, conventional REGNN training as in [1] outperforms meta-learning, as the initialization obtained by meta-learning induces a more substantial bias than joint training due to the mismatch in the conditions assumed for the updates on meta-training and meta-testing tasks (i.e., the different number of samples used for meta-training and adaptation).

 

Please see the paper for more results and a more extensive analysis, which is available here

 

[1] M. Eisen and A. Ribeiro, “Optimal wireless resource allocation with random edge graph neural networks,”IEEE Transactionson Signal Processing, vol. 68, pp. 2977–2991, April, 2020.

[2] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProc. InternationalConference on Machine Learning (PMLR). Sydney, 6–11 August, 2017, pp. 1126–1135.

[3] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,”arXiv preprint arXiv:1803.02999, 2018.

 

How similar should “similar” tasks be in meta-learning? (Ask an information theorist)

Problem

Conventional learning optimizes model parameters using a training algorithm, while meta-learning optimizes the hyperparameters of a training algorithm. A meta-learner has access to data from a class of tasks, and its goal is to ensure that the resulting training algorithm, also called base-learner, performs well on any new tasks from the same class. For example, the base-learner could be a stochastic gradient descent (SGD) algorithm with hyperparameters like initialization or learning rate.

The tasks observed during meta-training are conventionally assumed to belong to a task environment, which defines a distribution over the class of tasks, where each task has an associated data distribution. The statistical properties of the task environment then determine the similarity between the tasks. Intuitively, if the average “distance” between data distributions of any two tasks in the task environment is small, the meta-learner should be able to learn a suitable shared hyperparameter by observing fewer number of tasks.

In our recent work accepted to ISIT 2021, we build on the above observation and address the following questions for a fixed base-learner and meta-learner: How to measure task similarity? Given the level of similarity of the tasks in the environment, how many tasks and how much data per task should be observed to guarantee that the target average population loss for new tasks can be well approximated using the available meta-training data?

The difference between the average population loss on a new, previously unseen, meta-test task and the meta-training loss on the data gathered from the meta-training tasks is the meta-generalization gap, and is a measure of the generalization capability of the meta-learner. Our main contribution is a novel information theoretic bound on the average absolute value of the meta-generalization gap, that explicitly captures the impact of task relatedness, the number of tasks, and the number of data samples per task on meta-generalization.

 

Results

 Although information-theoretic bounds on generalization performance of meta-learning have been previously studied – in both average and high probability PAC-Bayesian settings, they fail to capture the impact of task similarity in meta-generalization gap.  We identify the following distinguishing components of our analysis that enable the explicit characterization of task similarity:

  • Performance metric – Earlier work on meta-learning considers the absolute average meta-generalization gap (  ) as the performance metric, that computes the absolute value of the average of the meta-generalization gap over selection of meta-training and meta-test tasks. By “mixing up’’ the tasks via first averaging over the tasks and then taking the absolute value, the metric fails to account for the dissimilarity between the training and test tasks.
    • We mitigate this drawback via a new metric, namely the average absolute value of the meta-generalization gap ( ). This metric first computes the absolute value of meta-generalization gap for a  given selection of meta-test task and meta-training tasks, and then  average it over all such selections. By doing so, this metric distinguishes the contribution of each selection of meta-training   and  meta-test tasks, and thus capture the role of similarity between tasks, in the generalization performance of a meta-learner. Moreover, in contrast to absolute average meta-generalization gap, this new metric is non-vanishing in the asymptotic limit of large number of tasks and per-task training samples. This clearly reflects that the meta-training loss cannot provide an asymptotically accurate  estimate of meta-test loss, which is evaluated on a priori unknown task.
  • Measures of Task Relatedness – A task environment is said to be  epsilon- related if the average “distance” between the data distributions of any two tasks in the environment is upper bounded by epsilon.  We consider KL divergence based as well as the Jensen-Shannon based distance measures. While the former can be unbounded, the latter is always bounded.

Using the above defined measures of performance and task relatedness, we obtain novel information theoretic bounds on the average absolute value of the meta-generalization gap. The obtained bound demonstrates that (a) as the task dissimilarity parameter  increases, more number of meta-training tasks are required to ensure meta-generalization, and that  (b) there exists a non-vanishing gap, which arises due to task dissimilarity, even in the limit of large number of meta-training tasks and meta-test tasks.

We also study examples where the obtained bound can be evaluated analytically or numerically. For the example of ridge regression with meta-learned bias, we illustrate the impact of task dissimilarity parameter on the two performance metrics, and their corresponding upper bounds,  in the following figure. As can be seen, while the absolute average meta-generalization gap metric appears to be largely insensitive to task dissimilarity, our metric reveals the role of task similarity, as captured by the bounds derived in the paper.

 

Learning to Learn How To Spike

 

Problem

Machine learning models have become a ubiquitous tool for inference in various end-user applications such as natural language processing, face recognition and mobile healthcare management. In such highly personalized cases, it is important for the model to be able to adapt to the unique features of an individual with only a minimal amount of training data. Furthermore, the devices for which these personalized inference problems are relevant are typically power and/or memory constrained which limits the size of the model that can be used for learning on the edge.

Solution

In our recent work, accepted for publication in DSLW 2021, we focus on a solution using spiking neural networks (SNNs) that leverages the popular model agnostic meta-learning method to train a model that performs online inference while concurrently improving its ability to adapt to new tasks (termed online-within-online meta-learning (OWOML)-SNN). This meta-training method learns a hyperparameter initialization that can be quickly adapted to new tasks from a certain family as opposed to the standard paradigm in which a large ‘universal’ model is fixed at deployment. Biologically inspired SNN models that operate on sparse binary signals, are known to provide benefits in terms of energy efficiency, especially (as in our case) where a local learning rule would allow it to be implemented on a neuromorphic edge processor such as Intel’s Loihi chip.

As show in Fig. 1, the proposed system assumes a stream of tasks for the SNN to adapt to, with streaming within-task data that it can learn from for online inference. Data from past tasks is stored in a fixed size buffer to be used for the meta-learning update. When new data for the current task arrives, multiple SNNs are instantiated using the hyperparameter initialization. One of those SNNs is used for online learning and inference while the others implement the meta-learning update to the hyperparameter initialization for use in the next iteration.

Results

The performance of OWOML-SNN is compared to conventional per-task training and joint training. Under conventional training, a single SNN model is randomly initialized and learns only from the data available for the current task to perform online inference. Under joint training the SNN learns a so called ‘universal’ initialization by standard training across all previous tasks and then does per-task adaptation. We consider the family of omniglot character 2-way classification tasks with rate encoding to test the within-task generalization capabilities of our meta-learner.

The results in Fig. 2 for test accuracy over training at convergence of the hyperparameter initialization and joint training initialization show that the meta-learned initialization enables the fastest online adaptation of the three, providing an improvement in accuracy over conventional training that is about 18% larger that that of joint training after training on 5 examples from each class.

Please see our paper for further details.

 

Bleema Rosenfeld

Sensing, Communicating, and Classifying with Spikes

Or how to remotely classify data with 80% accuracy and zero latency at Signal-to-Noise Ratios (SNRs) as low as – 8 dB.

Problem
The development of Internet of Things (IoT) systems, with applications ranging from personal healthcare and wearable devices to drone-based monitoring, is driving research efforts on edge-based machine learning. In such systems, data may be collected by battery-powered sensors and processed at a remote device, which may itself be energy-constrained. Standard hardware implementations pose major energy and latency limitations for such applications.
In our new paper, recently accepted at Asilomar 2020, we investigate a novel solution based on neuromorphic sensors, processors, and transmitter/receivers. In neuromorphic sensors, spikes (i.e., binary signals) mark the occurrence of a relevant event, e.g., a significant change in a pixel for a neuromorphic camera. Extremely low energy is consumed when the monitored scene is idle. In neuromorphic processors, known as Spiking Neural Networks (SNNs), spiking signals are processed via dynamic neural models for the detection of spatio-temporal patterns. SNNs have recently emerged as a biologically plausible alternative to ANNs, with significant benefits in terms of energy efficiency and latency. Finally, for communications, pulses, or spikes, can encode information for radio signalling via low-power Impulse Radio (IR). Commercial products are available for all these blocks, including DVS cameras, Intel’s Loihi SNN chip, and transceivers implementing the IEEE 802.15.4z IR standard.

System model
As seen in Fig. 1, the proposed system consists of the integration of neuromorphic sensing and processing with IR transmission, and it carries out Joint Source-Channel Coding (JSCC), as it performs source and channel coding in a single step. The signal sensed by the neuromorphic sensor, e.g., a DVS camera, is encoded as a vector of binary spiking signals, and processed by an encoding SNN that performs source and channel coding. The SNN defines a probabilistic mapping that is defined by its parameter vector. The output of the encoding SNN is modulated using parallel IR transmissions, with each spike encoded by an IR waveform such as a Gaussian monopulse. The channel is modeled as a frequency-flat Gaussian channel. Finally, the received signals are classified via a decoding SNN whose output can be interpreted as a class index using standard methods for SNN-based classification. For example, rate decoding predicts a class by selecting the neurons in an output layer with the largest number of spikes.
The proposed system in Fig. 1, termed NeuroJSCC, is trained by maximizing the log-likelihood that the decoding SNN outputs desired spiking signals in response to a given input. Details on the training procedure, and the resulting algorithm, can be found in the preprint.

Experiments

To illustrate the advantage of the system, we focus on an example consisting of the remote detection of handwritten digits recorded by a neuromorphic camera.
We compare NeuroJSCC to two benchmark schemes:
1) Uncoded transmission: The observation is directly transmitted through the Gaussian channel using On-Off Shift Keying (OOK), and classified using an SNN.
2) Separate Source-Channel Coding (SSCC): The encoder applies state-of-the-art quantization based on the Vector Quantization Variational Autoencoder (VQ-VAE) scheme, followed by LDPC encoding. The spiking signal is encoded as frames, and the scheme is applied separately to each one of them. At the decoder side, frames are decoded using the Belief Propagation algorithm, decompressed using VQ-VAE decoding, and then classified. We consider two different classifiers, namely traditional ANN and SNN.
In Fig. 3, we evaluate the test accuracies at convergence obtained for different levels of SNR and the different schemes. The accuracy of Uncoded transmission drops sharply at sufficiently low SNR levels. In contrast, NeuroJSCC maintains a test accuracy of 80%, even at an SNR level as low as −8 dB. Separate SCC with an SNN as classifier suffers the most from the degradation of the SNR. Using an ANN proves more robust to low SNR levels, since an ANN can benefit from the non binary outputs of the VQ-VAE decoder without further loss of information due to binary quantization.
We refer to the main text for further experiments and analysis.
Code will be released shortly on our Github page.

Coding and Lazy Aggregation for Robust and Efficient Distributed Learning

 

Figure 1: Parameter Server (PS) computing architecture.

Problem Overview:

In order to scale machine learning so as to cope with large volumes of input data, distributed implementations of gradient-based methods, e.g., Gradient Descent (GD), that leverage the parallelism of first-order optimization techniques are commonly adopted. To run GD, as illustrated in Fig.~\ref{fig:model}, multiple parallel workers perform computations of the gradients and the Parameter Server (PS) iteratively aggregates the computed gradients and communicates the updated parameter back to the workers. In the process, the PS computing architecture is subject to two key impairments. First, the potentially high tail of the distribution of the computing times at the workers can cause significant slowdowns in wall-clock run-time per iteration due to straggling workers. Second, the communication overhead resulting from intensive two-way communications between the PS and the workers may require significant networking resources to be available in order not to dominate the overall run-time.

To jointly address these impairments, in a recent work just published on IEEE Transactions on Neural Networks and Learning Systems, we study the performance of coding and lazy aggregation techniques for the PS architecture in terms of wall-clock run-time complexity, communication complexity, and computation complexity.

Main Results:

To explore the trade-off among wall-clock time, communication, and computation requirements, we provide a unified analysis of the techniques of gradient coding (GC), worker grouping, and adaptive worker selection, also known as Lazily Aggregated Gradient (LAG), whose relative merits are summarized in Table I. Both GC and grouping are full-gradient approaches that aim at increasing robustness to stragglers by leveraging storage and computation redundancy. Thanks to coding, with GC, only a given number of workers, dependent on the computing redundancy, need to finish their computations and send their encoded computed gradients to the PS at each iteration in order to retrieve the gradient. Grouping applies data duplication and coding to groups of workers. In contrast, LAG is an approximate gradient descent scheme that judiciously selects a subset of active workers at each iteration in order to reduce communication and computation loads. By integrating all the techniques,
we propose a novel strategy, named Lazily Aggregated Gradient Coding (LAGC), that aims at exploring the trade-off between the robustness to stragglers of GC and the computation and communication efficiency of LAG by generalizing both schemes.

Figure 2: Time, communication, and computation complexity measures under the Pareto distribution.

Figure 3: Time, communication, and computation complexity measures under the exponential distribution.

As a special case, we also introduce a scheme that only uses grouping and adaptive selection, which is referred to as G-LAG. For illustration, we consider a linear regression model under two representative distributions, i.e., Pareto distribution and exponential distribution, accounting high- and low-tails for the distribution of the computing times for the workers. Time, communication and computation complexities of the existing strategies, namely GD, GC, and LAG, and the proposed strategies, i.e., LAGC and G-LAG, are shown in Fig. 2 and Fig. 3. It can be seen that both of the proposed LAGC and G-LAG are capable of combining the benefits of gradient coding and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection (see Table I). Furthermore, G-LAG provides the best wall-clock time and communication performance, while maintaining a low computational cost.
The full paper can be found here.