4Running a NOC Operation
What makes a NOC effective?
Here are some basic components of effectiveness across the five core ITIL processes relevant to the NOC:
Event Monitoring and Management
Utilized by the NOC to keep track, identify, and manage events and issues that are related to the organization's infrastructure and systems.
- The events may come in the form of alarms from systems, calls from internal staff or customers, as well as emails or chats.
- The NOC team uses one or more tools, including NMSs, EMSs, and APM tools, to receive and filter messages from various infrastructure such as devices, servers, cloud instances, and applications, using different protocols like SNMP, TL1, WMI, gRPC, and gNMI.
- Once an event is detected, it is assessed, correlated, and acknowledged, and if needed, it is recorded into an incident or ticket for further management.
Incident Management
Incident Management is a fundamental process within any NOC, where the IT service management platform or ticketing system is used to address network, system, or application events.
- The NOC engineers create tickets containing event information in various fields, which are then handled and communicated to relevant personnel via email, call, or message for issue resolution.
- Regular updates are provided until the incident is resolved, and these tickets serve as a collective record of all NOC work activities, enabling resource and workflow management reporting.
Problem Management
Involves all the necessary actions to identify the underlying causes of incidents and request modifications to address those issues.
- It is distinct from Incident Management, as it focuses on investigating and discovering the root cause of an incident rather than addressing its immediate impact.
- Usually, Problem Management necessitates more advanced engineering skills to examine the patterns leading up to an incident, scrutinize logs for clues that suggest potential causes of the failure, and develop strategies to prevent future incidents.
- Additionally, the Problem Management service keeps records of problems and temporary solutions to be used by Incident Management personnel.
Capacity Management
Capacity Management oversees the performance, utilization, and capacity of infrastructure components in order to meet the client's service level targets.
- It's important for Capacity Management to address the needs of business capacity, service capacity, and component capacity in order to ensure continued success.
- Senior engineers must regularly review reports and alarm thresholds, considering the desired business outcomes and the impact of utilization on business operations, to make sure that evolving capacity needs are addressed in a timely manner.
Change Management
The objective of Change Management is to minimize the risk associated with alterations made to the supported infrastructure environment.
- This involves identifying anticipated changes and determining how each change should be handled to minimize the impact on the organization.
- The Change Advisory Board is responsible for reviewing and establishing policies for all of these changes. This board helps to mitigate risk by ensuring that all possible effects of the change have been taken into account, and a proper plan with a recovery process is in place.
Generally, processes and controls are focused on three types of changes:
- Standard changes are routine and low-impact, such as resetting passwords.
- Emergency changes are urgent and require immediate attention, such as rerouting network traffic when the primary WAN uplink at a regional office is unstable.
- Normal changes are planned in advance and may include upgrading the operating system on a server cluster, for example. These changes will be managed through a review process to ensure proper planning.
What are the challenges of running a NOC?
NOCs often face a number of common challenges that keep them from performing optimally, which we’ve identified and summarized below. The next section addresses each of these with a best practice.
Challenge #1: Overutilized technology staff and high support costs due to the lack of a tiered organizational structure
Many NOCs need a tiered operational support structure to manage their workload efficiently. Such a structure consists of different levels of technicians with varying skills, who are tasked with different responsibilities and work within a defined process framework. Without it, all tasks fall on the same set of staff, leading to overutilization and higher costs. A tiered structure allows the lower-cost Tier 1 team to handle routine activities, freeing up higher-level teams for more advanced support issues.
Challenge #2: Insufficient operational metrics leading to blindness to issues and opportunities
Often, important metrics are not measured or not evaluated regularly. This leads to the early indicators of potential issues going unnoticed, resulting in more resource-intensive problems. A lack of metrics in areas like first-call resolution, percentage of abandoned calls, mean time to restore, and number of tickets and calls handled can be detrimental to the NOC’s efficiency and effectiveness.
Challenge #3: High turnover, low morale, and difficulties in hiring, training, and retaining staff due to a lack of staffing strategy
The absence of a staffing strategy can lead to high turnover rates, low morale among staff, and difficulties in attracting, training, and retaining top talent. This can negatively impact the overall functioning of the NOC. A staffing strategy should be based on the overall activity of the NOC, including the volume of calls, emails, and alarms handled, the duration of incidents, and utilization metrics. It should also include benefits, training, and employee growth plans.
Challenge #4: Inconsistent responsiveness to issues or difficulty troubleshooting due to poor/unstandardized process frameworks
When NOCs do not have a standardized and well-defined process framework in place, it becomes difficult for NOC personnel to respond quickly and effectively to issues. This leads to inconsistent responsiveness and difficulties in troubleshooting, which can result in longer resolution times and decreased customer satisfaction (as well as confusion and frustration among NOC staff, leading to low morale, high turnover, and difficulties in hiring and training new staff). A framework provides specific procedures for handling various support situations using management frameworks such as MOF, FCAPS, and ITIL. The best place to start is with Incident Management, Problem Management, and service desk.
Challenge #5: A constant state of vulnerability due to a lack of a business continuity plan
Without a business continuity plan, the NOC will be unable to quickly and effectively respond to unexpected disruptions or emergencies, leading to prolonged downtime and decreased productivity. This can result in lost revenue, increased costs, and damage to the company's reputation.
Challenge #6: Recurring problems and an inability to emerge out of a reactive state due to a lack of quality management
NOCs often fail to implement effective processes for tracking and analyzing incidents, and for taking corrective action to prevent future incidents. Without a focus on quality management, the NOC is likely to be in a constant state of firefighting, with staff spending all their time reacting to problems and not enough time proactively addressing underlying issues.
Challenge #7: Lots of data, but little actionable insight due to disparate tools and platforms
NOCs often struggle to consolidate event information from multiple sources into a single view for staff action. Without integration between tools, NOC staff must track multiple screens and manually collect information, leading to missed SLAs, operational inefficiencies, and staff stress.
Challenge #8: Persistent operational problems due to out-of-date documentation and runbooks
The NOC faces persistent operational problems due to out-of-date documentation and runbooks. The lack of proper documentation hampers the NOC's ability to resolve issues and optimizes performance. The root cause of poor documentation is often a lack of resources and expertise to create work instructions and processes, leading to an informal system of knowledge transfer through mentorship.
Challenge #9: Business growth stymied due to a rigid, unscalable NOC
Many NOCs aren’t designed to be scalable; that is, able to handle a growing amount of work as the company grows without compromising the level of service. Typically, business plans include initial funding, sales and marketing, system build-out, operations support, and the business guidance needed to meet the projected growth. What business plans sometimes don’t take into consideration are predictable growth and process planning. The ability to grow or absorb expansion requires careful consideration of staffing, systems and network, tools, process standardization, and training.
Challenge #10: Unreasonably high operational costs
Running a NOC comes with high operational costs, including staffing, training, resources for technology, and ongoing support. Neglecting any of these components can result in difficulty scaling and meeting organizational objectives, leading to unreasonable operational expenses.
What are some NOC best practices?
The following is a list of best practices that address each of the challenges listed in the previous section. All of these best practices are described in greater detail in our other guide.
Best practice #1: Implement a tiered organization/workflow
A tiered organizational structure can help ensure that tasks are handled efficiently and effectively, with the right people doing the right work at the right time. This can help to minimize response times, reduce errors, and ensure that critical issues are addressed quickly. Download our free white paper to learn more about implementing such a structure here.
Best practice #2: Track meaningful operational metrics
Tracking meaningful operational metrics can provide valuable insights into network performance and help identify areas for improvement. Metrics such as mean time to resolution (MTTR), service level agreement (SLA) compliance, and ticket volume can all help inform decision-making and drive continuous improvement. Read our metrics guide here.
Best practice #3: Develop a strategy for hiring, training, and retaining top talent
Hiring the right people is critical to the success of a NOC operation. A well-designed hiring strategy should focus on attracting, training, and retaining top talent, with an emphasis on technical and customer service skills. Read our staffing guide here.
Best practice #4: Implement a standardized framework for process management
A standardized framework for process management can help ensure consistency and efficiency in the NOC operation. This framework should include standard operating procedures (SOPs), workflows, and best practices for managing incidents, changes, and other common tasks. Download our free white paper to learn more about implementing a process management framework here.
Best practice #5: Develop and maintain a business continuity plan
A business continuity plan (BCP) is critical for ensuring that the NOC can continue to operate in the event of a disaster or other disruptive event. The BCP should include procedures for backing up and restoring data, protecting critical infrastructure and ensuring the safety of staff. Download our free white paper for a 5-step strategy you can use to implement a NOC BCP here.
Best practice #6: Develop an effective customer experience management program
A focus on customer experience can help ensure that customers are satisfied with the services provided by the NOC. This can involve regular customer feedback, continuous improvement initiatives, and an emphasis on responsive and proactive customer service. Learn more about our approach to customer experience management here.
Best practice #7: Develop platform integrations and consolidate data for action
By integrating data from various platforms and tools, the NOC can improve decision-making, speed up problem resolution, and increase efficiency. Consolidating data into a single source of truth can help ensure that the right information is available to the right people at the right time. Download our free white paper to learn more about developing platform integrations here.
Best practice #8: Develop platform integrations and consolidate data for action
Documentation is critical to the success of a NOC operation. This includes process documentation, technical documentation, runbooks with alarm-to-action guides, and training materials. Proper documentation can help ensure that the NOC runs smoothly and that staff are properly trained and equipped to perform their tasks. Take a look at the “anatomy” of an effective runbook here.
Best practice #9: Design your NOC operation for scalability
As an organization grows, its NOC operation may need to scale to meet new demands. A well-designed NOC operation should be scalable, flexible, and able to adapt to changing requirements. Download our free white paper to learn more about building a scalable support operation here.
Best practice #10: Budget your NOC operation appropriately
A well-funded NOC operation is critical to ensuring that the NOC has the resources it needs to be successful. This includes funding for staff, infrastructure, tools, and ongoing training and development initiatives. A well-designed budget can help ensure that the NOC is well-equipped to deliver high-quality services to customers. Download our free white paper to learn more about budgeting the costs of a NOC here.
What skills are required to run a NOC?
NOC engineers need a variety of skills to keep networks, infrastructures, and applications up and running. Diverse technical knowledge, including knowledge of various network technologies, cloud environments, server operating systems, virtualization, storage systems, and applications, is required to run the modern NOC.
This demand for skilled human resources in a 24x7 environment can pose a considerable and often insurmountable challenge for many organizations, driving many to outsource this function.
In addition to these well-established skills needed in the modern NOC, innovations demand new types of skills. Machine learning and artificial intelligence in particular pose new challenges that don’t always lend themselves to time-tested best practices.
Even many seasoned NOC engineers, for example, haven’t dealt with networks becoming more “aware” of the traffic that runs through them. Developing skills that complement machine intelligence is just one example of the evolving challenges that those who work in the NOC will need to overcome.
Prioritizing NOC design (along with the right training programs and tools) to maximize your team’s capabilities from the start saves an incredible amount of expensive, labor-intensive work as the NOC comes to life.
Understanding the required skill set early on can help you identify the correct staff to hire. It can also drive the selection of tools to manage the infrastructure over time. In addition, a rigorous, ongoing knowledge management and training program is important to ensure the entire NOC team is up to date on all changes made to the supported infrastructure.