A digital twin (DT) consists of a high-fidelity virtual replica of a physical entity, the physical twin (PT), such that, in the words of DT pioneer Michael Grieves, “at its optimum, any information that could be obtained from inspecting a physical manufactured product can be obtained from its digital twin” [1]. Based on a fully automatized bi-directional flow of information [2], the DT uses the data collected from the physical world to maintain an up-to-date model of the PT, which in turn provides command and analysis functionalities to the PT (see Fig.1). With the ever-growing demand in communication resources, next-generation wireless networks will be required to adapt to a large number of scenarios, and DT platforms are increasingly seen as a promising data-driven solution to build intelligent wireless systems that can offer the necessary flexibility and responsiveness.

As depicted in Fig. 1, in a recent work to be presented at the IEEE International Conference on Communications, we consider a DT that autonomously learns a model of a wireless network, providing a safe sandbox environment for network optimization and analysis, while also enabling monitoring and prediction features. The main motivation for our work stems from the realization that, in real-world scenarios, it is challenging to transfer sufficient data to and from the PT in a way that “any information that could be obtained from inspecting the PT can be obtained from its DT”. In light of this, we propose to leverage Bayesian methods to learn a DT model that is aware of “what it knows” as much as it is aware of “what it does not know”; taking into account the epistemic uncertainty arising from limited PT-to-DT communication [3].

## The Physical Twin

We consider a PT system made of a group of devices, referred to as *agents*, that attempt to communicate with a single base station (BS) over a shared multi-packet reception (MPR) channel [4]. Each agent is equipped with a limited-capacity buffer, and packet generation is taken to be correlated in-time and among agents. At any given time slot, each agent can take an *action* to decide whether or not to transmit a packet from their buffer, which can be received at the BS depending on the MPR channel dynamics. Upon packet reception, the BS transmits acknowledgement signals to the corresponding agents before the next time slot.

Agents cannot communicate with each other and each agent can only sense its local *state*, which contains information about its packet generation, buffer occupancy and BS feedback at a given time. Given the collective states and actions of all agents, the PT system evolves to a new state according to a *transition distribution* that is unknown to the DT and describes the packet-generation, buffer and channel dynamics.

## The Digital Twin

### Model Learning

During model learning (step 1 in Fig. 1), the DT leverages sequences of states and actions collected from the PT to learn a parametric *Bayesian model* of the transition distribution. As opposed to frequentist learning, which only keeps the most probable model parameter, Bayesian learning keeps a (possibly infinite) *ensemble* of models, where the probability of each model is given by a *posterior distribution*. Given that all state variables are discrete, we represent the transition distribution using a categorical model and learn the corresponding posterior using the conjugate Dirichlet distribution [3]. In order to lower the spatial complexity of the model, we leverage prior information available at the DT about state transitions like data-generation clusters, known buffer dynamics, and symmetry of the MPR channel.

### Policy Optimization: Safely Learn by Trial and Error

A medium access control (MAC) protocol at the PT can be established by providing each agent with a *policy* distribution that maps the sequence of locally observed states and actions into a new action. Using the learned model, we can safely asses new policies in virtual space by defining a *reward* function that yields positive values for desired behavior (e.g. successfully delivered packets) and negative penalties for undesired behavior (e.g. buffer overflow). Policy optimization (step 2 in Fig. 1) aims at providing an optimal policy to each agent that maximizes the expected sum of future rewards. This amounts to a *Decentralized Markov Decision Process* [5] problem that we tackle using the COunterfactual Multi-Agent (COMA) algorithm proposed in [6], in which we periodically sample a new transition distribution from the model posterior during training.

### Monitoring: Let’s Agree to Disagree

After an initial model learning phase, the DT can provide monitoring features by checking whether newly received data fits previously observed transitions, or if it rather provides evidence of changed dynamics or anomalous behavior (step 3 in Fig. 1). To this end, we use a *disagreement-based test metric* that measures to which extent the Bayesian ensemble of models agree on the likelihood of the newly observed data. A large disagreement is taken as evidence of a large epistemic uncertainty compared to model-learning conditions, which in turn can indicate that the observation is anomalous.

## Results

We evaluate the proposed DT platform on a simulated scenario consisting of 4 devices distributed across 2 data-generation clusters. The MPR channel allows for the successful delivery of one or two simultaneous packets; while more than two simultaneous transmissions cause the loss of all packets. Each device is equipped with a buffer with single-packet capacity.

During policy optimization, we reward successfully delivered packets, while we penalize *buffer overflows*, caused by generating a new packet on an already full buffer. We analyze the performance of the policy trained inside the Bayesian model across different sizes of model-learning datasets, and compare it to a policy trained inside the corresponding maximum a posteriori (MAP) *frequentist* model, and to an *oracle-aided* policy that is trained using the ground-truth transition distribution.

From Fig. 2, we observe that, in regimes with high data availability during the model learning phase, both Bayesian and frequentist model-based methods yield policies with similar performance to the oracle-aided benchmark. In the low-data regime, however, Bayesian learning achieves superior performance compared to its frequentist counterpart.

To asses the performance of anomaly detection, we assume that an anomalous event occurs where a device is disconnected, resulting in an anomalous packet-generation distribution in the corresponding cluster. We compare the performance of the disagreement metric using the Bayesian model to a log-likelihood criterion using the frequentist MAP model for model-learning datasets comprising 20 and 50 transitions and report the results in the receiver operating curves (ROC) in Fig. 3.

Bayesian anomaly detection is observed to uniformly outperform its frequentist counterpart, achieving a higher area under the ROC in Fig. 3.

For a more formal presentation of our proposed Bayesian framework for wireless networks DTs and more details on the experimental procedure, please refer to our paper at this link and to the extended version at this link.

## References

[1] M. Grieves and J. Vickers, “Digital twin: Mitigating unpredictable, undesirable emergent behavior in complex systems,” in Transdisciplinary perspectives on complex systems. Springer, 2017, pp. 85–113.

[2] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn, “Digital twin in manufacturing: A categorical literature review and classification,” IFAC-PapersOnLine, vol. 51, no. 11, pp. 1016–1022, 2018.

[3] O. Simeone, Machine Learning for Engineers. Cambridge University Press, 2022.

[4] L. Tong, Q. Zhao, and G. Mergen, “Multipacket reception in random access wireless networks: From signal processing to optimal medium access control,” IEEE Communications Magazine, vol. 39, no. 11, pp. 108–112, 2001.

[5] F. A. Oliehoek and C. Amato, A concise introduction to decentralized POMDPs. Springer, 2016.

[6] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.

## Leave a Reply