Introduction

Autonomous networks are characterized by their capability to self-provision, scale, heal and maintain themselves in response to external or internal factors. These self-X capabilities are mainly triggered by various events ranging from service provisioning requests to equipment or service alarms, under the larger umbrella of event-driven automation.

The ability to correlate such events across the temporal and spatial domains enables the autonomous network and the NOC engineers to identify any changes in network state and determine the appropriate response to them. Temporal correlation identifies patterns in the frequency and number of event occurrences, while spatial correlation links events at different physical and logical levels of the network topology, driving an intelligent and effective response to the observed conditions.

In this article we will leverage the reasoning capabilities of large language models (LLMs), combined with function calling and prompt engineering, to build an intelligent event correlation agent on top of a radio access network (RAN) topology using the Waylay platform. 

The network entity model

We begin by modeling the RAN topology which will act as the knowledge graph for our agent, supporting the event correlation operations. The topology is based on a set of entities and the relationships (links) defined between them. There are multiple ways to model a radio access network, based on the used technology or vendor-specific characteristics, a relevant example being the 3GPP NR NRM model

For our exercise we will adopt a more streamlined representation, shown in Figure 1. Our model is centered on the BaseStation entity, which belongs to a Site, is managed via an EMS and experiences Events, some of which may trigger associated RemediationActions. For example, a power outage event experienced by a base station may trigger the dispatch of an intervention unit to the site.

Figure 1. The RAN entity model 

For more advanced use cases, network functions like the DU and CU can be modeled, together with the UE or NetworkSlice details. 

Network topology as a knowledge graph

The next step in our project is to populate the RAN topology based on the chosen entity model. We want this topology to act as a knowledge graph for the LLM, enabling it to traverse it and perform spatial event correlations taking into account the relationships between different RAN entities.

During the GenAI-based toolkit for network & service management TM Forum Catalyst project, we found Neo4J to be the most suitable way to store graph-based network topology information like the one above. As a bonus, LLMs like OpenAI's GPT-4 or Anthropic Claude 3 possess the intrinsic capability to generate Neo4J Cypher queries, an ability that stems from their extensive training data.  

Based on environmental alarm data collected from a lab setup, we populated a Neo4J instance with the RAN topology information, as shown in Figure 2.

Figure 2. RAN topology detail, showing a base station and its associated alarm events

This network topology plays a double role. Next to storing the network representation and associated events (and ideally keeping it up to date by consuming event streams from sources like the OSS, NMS or EMSes), it also acts as a knowledge graph towards the LLM. 

By making a parallel to the concept of augmented retrieval (RAG), where a vector database acts as a knowledge base for the answers provided by the LLM, the Neo4J database containing the network topology becomes a knowledge graph, providing the LLM with the data required to answer queries related to the network state. For example, is site X currently affected by both temperature and power supply alarms? or which are the top 5 sites affected by low power alarms and in which areas?

Unlike RAG, the knowledge graph was not yet supported natively by the latest generation of LLM models at the moment this article was written. However, it can be exposed to the models via other techniques like function calling. 

Exposing network topology semantics via ontology

As mentioned before, our experience has shown that models like OpenAI's GPT-4 or Anthropic Claude 3 possess the capability to generate Neo4J Cypher queries. If we configure them with a tool (function) that allows them to execute a query against a Neo4J database, they will be able to generate a syntactically correct cypher query from the user prompt and execute it via the provided tool.

However, a syntactically correct query will not be sufficient. As the LLM does not have the innate capability to understand the database model, it cannot generate a semantically correct query, which is what we require to leverage the knowledge stored in the network topology. 

To enable the model to understand the knowledge graph, we must provide it with an ontology, which describes the entity model on which the knowledge graph is based. We can regard the ontology as a formal, textual description of Figure 1, which is provided upfront to the LLM either via the system prompt (message) or via other techniques like fine-tuning.

In our case, we provide the RAN ontology via the LLM's system message and encode it in the JSON format for a formal, unambiguous and structured representation. Below is an example of the ontology related to a subset of the RAN model, related to the Area, Site, BaseStation, EMS and Event entities and the relations between them.  

1{
2 "relationships": [
3   {
4     "description": "A Site belongs to an Area.",
5     "from": "Site",
6     "to": "Area",
7     "type": "BELONGS_TO"
8   },
9   {
10     "description": "A BaseStation belongs to a Site.",
11     "from": "BaseStation",
12     "to": "Site",
13     "type": "BELONGS_TO"
14   },
15   {
16     "description": "An Event is experienced by a BaseStation.",
17     "from": "Event",
18     "to": "BaseStation",
19     "type": "EXPERIENCED_BY"
20   },
21   {
22     "description": "A a BaseStation is managed by an EMS.",
23     "from": "BaseStation",
24     "to": "EMS",
25     "type": "MANAGED_BY"
26   }
27 ],
28 "entities": {
29   "EMS": {
30     "properties": {
31       "vendor": {
32         "description": "the name of the EMS vendor"
33       },
34       "id": {
35         "description": "the unique EMS identifier, with no semantic details"
36       }
37     }
38   },
39   "Area": {
40     "properties": {
41       "name": {
42         "description": "friendly name"
43       },
44       "id": {
45         "description": "unique identifier, with no semantic details"
46       }
47     }
48   },
49   "Event": {
50     "properties": {
51       "description": {
52         "description": "full event description"
53       },
54       "id": {
55         "description": "unique identifier, with no semantic details"
56       },
57       "status": {
58         "description": "event status",
59         "supportedValues": [
60           "Terminated",
61           "Active"
62         ]
63       },
64       "severity": {
65         "description": "event severity",
66         "supportedValues": [
67           "Critical",
68           "Major",
69           "Minor"
70         ]
71       },
72       "type": {
73         "description": "the event type",
74         "supportedValues": [
75           "alarm"
76         ]
77       },
78       "clearedOn": {
79         "description": "the event clearance timestamp, in ISO date format (may be missing for events in Active state)"
80       },
81       "name": {
82         "description": "name of the event, also indicating its nature"
83       },
84       "createdOn": {
85         "description": "the event creation timestamp, in ISO date format"
86       }
87     }
88   },
89   "Site": {
90     "properties": {
91       "name": {
92         "description": "friendly name"
93       },
94       "id": {
95         "description": "unique identifier, with no semantic details"
96       }
97     }
98   },
99   "BaseStation": {
100     "properties": {
101       "vendor": {
102         "description": "the name of the base station vendor"
103       },
104       "name": {
105         "description": "friendly name"
106       },
107       "id": {
108         "description": "unique identifier, with no semantic details"
109       },
110       "technology": {
111         "description": "the base station technology",
112         "supportedValues": [
113           "2G",
114           "3G",
115           "4G",
116           "LTE",
117           "5G"
118         ]
119       }
120     }
121   }
122 }
123}
124

This enables the LLM to understand the structure of the knowledge graph and translate user prompts into cypher queries that are semantically correct and can execute successfully on the Neo4J database. 

Building and exposing the event correlation agent

So far we have prepared the RAN model, topology data and the ontology for our LLM. It is now the time to actually build the agent using the Waylay platform. 

The agent executes as a workflow (or rule template in Waylay terminology), and is exposed via the platform APIs to a chat-like user interface. The workflow, illustrated in Figure 3, contains the following nodes:

  • fetchRanOntology - retrieves the RAN ontology JSON representation from the Waylay resources model;
  • topologyQueryDescriptor - wraps the neo4j Waylay plugin as a tool descriptor that can be passed as part of the LLM configuration, enabling the LLM to execute neo4j queries using function calling;
  • eventCorrelator - interacts with the LLM API to pass the user prompt and receive the agent's answer;
  • outputSuccess, outputError - handle the LLM invocation result and expose it as the workflow execution result towards the chat user interface.

Figure 3. Network events correlation agent

During testing we found out that the way the system message is crafted has a significant impact on the behavior and accuracy of the agent, as follows.

  • First, we observed that providing graph traversal examples next to the ontology description via prompt engineering significantly increases the accuracy and correctness of the generated cypher queries. 

  • Second, we discovered that by default, the search for event types (e.g. temperature or power alarms) was done by the LLM in a very strict way, via case sensitive string comparison. Providing fuzzy search capabilities inside event names and descriptions via Neo4J full-text search indexes, and making the LLM aware of it via the system prompt, has made this search more flexible and user-friendly.

Below is the complete system message obtained as a result of several prompt engineering iterations.  

Figure 4. The LLM system prompt 

Finally, the agent workflow is coupled via Waylay task invocation APIs to the chat user interface, enabling NOC engineers to query the state of the network. The chatbot invokes the workflow for every user question, also passing the previous message history as argument, to enable a context-aware conversation.  

Figure 4. The chat user interface of the event correlation agent

Conclusion

Knowledge graphs represent a powerful data source for intelligent LLM-based agents. Via a combination of prompt engineering, function calling and ontology definitions, a non-trivial task like spatio-temporal network event correlation can be implemented quickly using the low-code capabilities of the Waylay platform. 

Even if the agent's accuracy does not yet reach 100%, when coupled with a chat interface it can provide a significant productivity boost for NOC engineers and pave the way towards Level 5 autonomous networks.