How massive automation helps you achieve the best quality out of your network.
Author: David Maisonneuve, CTO
Growing complexity of the networks.
From the different generations of mobile networks to the landline mobile and now WAN IP networks, a global operator’s network is an impressive mixture of technologies and network equipment.
And now, 4G networks are fully deployed, 5G is around the corner with the first implementations being scheduled in 2018. So how can the operators manage such network complexities and get the best quality out of their network while maintaining or even decreasing their operational costs?
Understanding your network – network essentials
The first challenge of the operator is to understand their network in order to control, manage, troubleshoot it, eventually offering the best end to end quality of service to their customers. Understanding your network is knowing your network (its topology, its parameters, Its KPIs, the customers behavior on your network, the triggers to act on…) and seeing what is occurring on your network (real time events, planned and done civil works, failures, …).
- The first step of understanding your network is to have the right configuration management database: every NE (Network element) must be stored in the database with their main parameters (IP configuration for instance) and of course the relationships between those NEs (Network topology). The more automatic the link between the network and this database the better. But unfortunately, there are still some manual actions to be done to populate or update the database information. The more manual interventions the greater the possibility for mistakes.
- The second step of understanding your network is to have the right “Ticketing” Database: to know in real-time what is occurring on your network. Either in terms of civil works and network operations or in terms of equipment failures with the trouble tickets. Keeping this information updated in real-time is a real challenge, especially for the civil works part as it which is often performed manually (by sub-contractors), but is mandatory for the QOS of the network.
- The third step is the fault management by supervising the “health” of each individual NE. Fault management (Alarms, traps, …) is the classical way of maintaining a network up and running. And with a large and complex network there is a need for a unified OSS fault manager that consolidates the Network Management Systems from multiple equipment vendors.
- The fourth step is the Performance Monitoring of the network, by measuring each individual NE. This is done either using external probes (third party vendors) or through the NE themselves. At this stage you can get two types of information: CDRs (Call Data Records that basically describe a network transaction like a voice call or an IP transaction – usually provided by probes or by embedded functionalities within routers like the Netflow protocol) and counters (A NE produces periodically many different counters that may be retrieved by SNMP or through the vendor management system (OMC)).
This data (CDR and counters) are the most precise information available to understand the network, but on the other hand, this data is the voluminous, generating terabytes of data and for this matter, often neglected because of management issues.
Constant observation of the network
A network “lives” because of its subscribers varying behavior and also because it’s constantly being upgraded:
- Adding new technologies (2G, 3G, 4G and soon to come 5G …)
- Capacity addition to face the traffic growth (upgrading the hardware capacity, adding new transmission links, …)
- External events putting local stress on your network (bad weather conditions, population density in one place, acts of vandalism…)
- Equipment failures
And as everybody knows, the biggest problems often occurs after an intervention on the network. Due to the fact that it’s not possible to eliminate all the possible mistakes or human errors it’s mandatory to have the best continuous monitoring of the network in order to detect the problems in real-time and then repair them as soon as possible:
- Permanent fault management monitoring,
- Post works validation (work on the network are the main causes of failures)
- Network capacity monitoring (traffic prediction allows the operator to have enough time to add capacity to prevent saturation) => capacity planning
- Individual NE Performance (not all the problems are seen through fault management alarms. It may help detect some work done incorrectly and wrongly referenced in the ticketing database – unfortunately with massive subcontracting it’s always an issue)
Given the complexities of the network and the huge number of NE, this continuous monitoring may be a very large task and necessitate many people. That’s why the best automation possible is mandatory.
Necessary Automation of tasks
In order to efficiently manage your network in a cost-effective way, automation is the key. Many tasks can or have to be automated at the maximum:
- Fault management and intelligent automatic ticketing
The main idea here is to avoid manual trouble-ticket creation. With the use of Intelligent rules it’s possible to automatically create a trouble-ticket according to particular situation (detected incident, equipment specific faulty behavior, … ) and in some cases the machine can decide to send an engineer for an on-site intervention.
- Detect Configuration management problems
Many of the quality problems are due to wrong parameter sets downloaded into the equipment. So, the best way to avoid these problems is to audit the parameters, against a reference database, at least once a day and in case of discrepancy: either create an alarm or force the reference database parameters into the equipment. Of course, you must have a total faith in this reference database. The common problems that can be seen are: wrong handover relationship, wrong transmission parameters, bad power configuration… You can use SON tools to manage the parameters of the mobile network.
- Automatically detect and prioritize abnormal NE behaviors through performance monitoring (Counters / CDR)
A network can have hundreds of thousands of different equipment (a large European mobile operator can have several hundred thousand radio cells). So, as it is impossible to manually check all those NE’s it’s mandatory to automate this process.
And of course, with a very large network there are always thousands of problems that are automatically detected every day, so prioritization is also mandatory. To prioritize you can use the level of the problem and show the most critical, you can also automatically link the detected problems with ongoing or past network operations. As urgency is often a matter of threshold you first must be able to evaluate the importance of a problem (in correlation with other information like tickets and costumer complaints)
- Handover Relation automatic correction (for mobile networks)
As the network grows, you constantly add new radio sites, or new technologies on existing sites thus needing to adapt the cell’s neighboring plan. This is usually done by the radio planning team, based on accurate theoretical tools, but these must be juxtaposed with the network reality. Mixing theory and neighbor measurements helps detecting relationship problems (missing neighbors or useless neighbors).
Automatic Neighbour relations tools can calculate the optimal neighbouring sets for each cell in the network (2G, 3G & 4G) and automatically download them to the network.
- Automatic network reporting
You cannot speak of network quality without having good reporting’s. As network quality reporting is always based on KPIs and technical information from the network, they can be fully automated. From high level management reporting to complex technical reporting, you must have the tools that can automatically produce and distribute them.
- Automatic database population
Having an updated configuration and parameters databases is always a challenge mainly because there is still a lot of human intervention. As there are no miracle solutions, it is very important to maximize the automated parts. This can be done by provisioning tools that can populate the reference database (according to a set of rules) before the equipment is even installed in the field. And if you have an efficient auditing tool (see before) then you should minimize the inconveniences.
Even if we’re not there yet, massive automation is the first step towards a self-healing network : Networks that can automatically handle some issues without any human intervention. Some actions are already possible such as traffic re-routing in case of transmission link failure, automatic software capacity activation or automatic equipment reset (or reboot) in some specific cases.