Despite being critical to the success of a technical support operation, many network operations centers (NOCs) fail to meet desired service levels.
Ineffective NOCs consume management and financial resources rather than delivering meaningful ROI by maximizing infrastructure uptime and performance, ultimately leaving business services vulnerable to disruption and downtime.
Most of the time, the root cause of an underperforming NOC is a lack of a centralized operational framework that incorporates best practices and implements them consistently. Such a structure is essential to the success of a NOC, as it makes decisions and actions consistent across the people, processes, and platforms that comprise it.
Without an authoritative operating blueprint, costly inefficiencies are inevitable and serious risks will continue to threaten performance and availability. This problem will only worsen as business services and the technologies they rely on scale in size and complexity.
Here, we explore eleven NOC best practices for keeping even the largest, most complex infrastructure environments up and running at peak performance 24/7.
đź“„ Download our white paper, "Top 10 Challenges to Running a Successful NOC," for a handy reference that discusses each of these best practices and their corresponding challenges in greater depth.
đź“„ Read our other white paper, "A Practical Guide to Running an Effective NOC," for a set of actionable steps you can take to put these best practices to use.
Need help putting these best practices into action? Let's talk NOC. Schedule a free NOC consultation and connect with our Solutions Engineers about improving your current support operation or getting exactly the level of third-party support you need.
One of the biggest obstacles to success is organizing your NOC activities and workflows according to your specific technologies and skill levels. When this critical foundational element is missing, NOC teams struggle with inconsistent response times, inefficient resource allocation, and frequent miscommunication.
Once you've cleared this hurdle, however, you'll almost certainly be able to handle events and service requests and resolve incidents at the appropriate tier—and much faster than before. Based on data collected across our NOCs, we found this structure can enable a NOC to resolve 65% to 75% of incidents at the Tier 1 level while reserving Tier 2 and 3 staff for more advanced issues.
Classifying NOC activities is often the first step in implementing a tiered structure. Use the following model for developing your own classification system:
|
Figure 1 below illustrates a well-organized tiered NOC support structure in action. Here, the Tier 1 team uses monitoring tools and interacts with end-user help desks, as well as Tier 2 and 3 engineers and third parties. Information flows between the various entities within a well-defined process framework.
Having such a structure for properly managing your workflow can prevent your NOC from being overwhelmed by the "wall of red" NOC teams strive to avoid at all costs. In most NOCs, issues should be prioritized and organized into a set of queues, so the appropriate group can handle each of them.
The structure should include specialized teams like Advanced Incident Management (AIM) or Critical Incident Management Team at the entry point of the workflow to ensure incidents are properly assessed, prioritized, and routed from the start. This upstream positioning of specialized teams is a key differentiator in high-performing NOCs, as it prevents improper handling of incidents from the moment they enter the system.
For maximum effectiveness, this structure should be supported by a well-designed platform that automates much of the initial analysis and routing, further enhancing the efficiency of your tiered approach. Download our white paper, "Top 11 Challenges to Running a Successful NOC," for a set of example workflow queues you can use to break up issues and assign them to groups based on skillset.
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
Modern NOC tools make it easy to generate metrics, but tracking meaningful metrics takes diligent work. Metrics are essential for continuous improvement from both a technical and motivational standpoint—helping teams recognize successes and keep morale high.
Anyone who works in a NOC likely hears things like, "We're always busy," or "I feel like we can never catch up," or "My coworkers are not pulling their weight." These sentiments are understandable given the fast-paced environment of a NOC and the constant multitasking that is required of those who work in it.
To ensure accomplishments are recognized, it's important to set performance objectives and evaluate them on a daily, weekly, and monthly basis. Since the amount of data available to an NOC is daunting, choose the most applicable and actionable metrics for your specific operation. These should reflect the size and scale of your operation and the KPIs that measure performance against relevant organizational objectives.
A few critical KPIs every NOC should be tracking include:
First-call resolution rate | The percentage of issues resolved during the first call or interaction. |
Percentage of abandoned calls | How many callers hang up before reaching support. |
Mean Time to Notify (TTN) | How quickly you alert clients to issues. |
Mean Time to Impact Assessment (TTIA) | How quickly you can determine what's affected. |
Mean Time to Restore (MTTR) | How long it takes to bring services back online. |
Number of tickets and calls handled | Volume metrics by time period. |
First-Level Resolution (FLR) rate | Percentage of issues resolved at Tier 1. |
Aside from KPIs, there's another category of metrics that is often a complete blind spot for many NOCs: utilization metrics.
These metrics reveal when and why the NOC is or isn't busy—and what it's busy with—so staffing levels can be fine-tuned for peak efficiency. They help answer questions like:
|
Without these utilization metrics, NOCs often end up either understaffed (leading to burnout) or overstaffed (leading to excessive costs). They also lack the data needed to identify process improvements that could significantly enhance efficiency.
A comprehensive reporting system should go beyond standard SLA compliance metrics to provide deeper operational insights. This could include:
|
Read our post for a deep dive on these metrics: NOC Performance Metrics: How to Measure and Optimize Your Operation.
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
Running a 24x7 NOC requires staffing three shifts a day, 365 days a year. Consider the following factors when developing a staffing strategy:
Effectively staffing a 24x7 NOC starts with a well-organized structure. The tiered NOC support structure and workflow queues discussed in our first best practice are good starting points for determining the skill level required of your NOC staff.
A skills-based NOC structure should not only support your operational requirements but also provide clear growth paths for employees. This career progression framework is crucial for retention, allowing team members to visualize their future within the organization and develop professionally.
đź“„ Download our white paper, "Top 10 Challenges to Running a Successful NOC," for an example of a skills-based NOC structure that can support 24x7 NOC requirements and provide a growth plan to maximize employee retention. |
Consider the overall activity of your NOC, including the volume of calls, emails, and alarms handled by hour of day, day of week, and type of support engineer, as well as the duration of incidents. This data is crucial for developing staffing models that align with actual workload patterns rather than theoretical assumptions.
With this data in hand, you can identify peak times requiring additional staff and slower periods where cross-training opportunities might exist. This approach ensures you have the right people with the right skills at the right times.
Consider the benefits that your company provides for employees in the context of the needs of your operation to ensure understaffing isn't a risk. For example, if your company provides 10 holidays and four weeks of PTO per employee, these hours need to be accounted for to ensure that your NOC runs smoothly.
For a 24/7 NOC, this calculation is especially critical, as any absence must be covered to maintain continuous operation.
A good rule of thumb: for every position that needs to be staffed 24/7/365, you typically need between 4.2 and 5.0 FTEs when accounting for training, PTO, sick time, and inevitable turnover. |
A NOC training program should cover initial onboarding as well as ongoing training. A truly comprehensive training program can take up to six months of various classes and on-the-job instruction before an engineer is ready to take on NOC support responsibilities.
The training program should include (at a minimum):
|
After work has begun, monthly or quarterly training sessions should be scheduled to keep engineers' skills fresh and to update the support team on new types of services, new customer requirements, and new equipment.
Based on your historical data and industry standards, a certain attrition rate within the NOC should be taken into account. Factors that affect retention rates include company culture and NOC organization (i.e., whether there's a clear path for employee growth from one level to the next or to other departments within the organization).
By making these calculations, you can better plan for staffing and training needs. For example, assuming that a typical engineer works five years in your NOC (a retention rate of 80%), you'd need to hire an additional 20% of staff each year.
For comparison, offshore NOC operations often experience significantly higher turnover rates—sometimes 40-50% annually—which creates continual knowledge gaps and inconsistent service quality.
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
đź“„ Read our post for a deeper discussion on staffing a NOC: Staffing a 24x7 NOC: Costs, Challenges, and Key Considerations |
Inconsistency is one of the main reasons NOCs don't perform at optimal levels. Being reliably consistent requires a standardized process framework that arms your NOC with specific procedures for handling various support situations.
There are several process and management frameworks to choose from, including MOF, FCAPS, and ITIL. The ITIL* (IT Infrastructure Library) service framework, in particular, has become immensely popular as it's useful in achieving the ISO 20000 certification and provides its own set of best practices to follow when delivering technology support services. It also offers the flexibility to include your organization's custom procedures under its umbrella of lifecycle stages.
The core processes that should be standardized in an ITIL-aligned NOC include:
|
Process frameworks can be overwhelming when considered in their entirety. When aligning with one of these standards, we recommend first tackling the specific areas that are your organization's biggest challenge. Typically, these are incident management, problem management, and the service desk. Once these functions are standardized, you can move on to other priority areas, such as change management and service continuity management.
It's critical to get your whole organization involved in implementing the process framework and in ongoing education. Training is essential to get all staff talking the same language and following the same guidelines. Comprehensive information and training are available for ITIL, ISO 20000, FCAPS, and other methodologies.
A standardized framework also facilitates:
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
A business continuity plan (BCP) is essential for managing risk in your NOC operations—a fact made very clear by the COVID-19 pandemic, which exposed serious shortcomings in many firms' readiness.
đź“„ Download our white paper, "A 5-Step Strategy for NOC Business Continuity Planning in Response to COVID-19," for a comprehensive guide to ensuring your BCP is prepared for such a disruption. |
The BCP provides a blueprint for NOC staff and management to follow when recovering from a disaster or other adverse situation. When properly executed, it ensures that operations recover quickly and effectively, so any negative impact on the business is minimized.
The components of an effective NOC BCPA comprehensive BCP should include multiple layers of redundancy and contingency plans: Infrastructure redundancy
Operational redundancy
Technical redundancy
|
Without an effective BCP in place, your NOC will almost certainly remain vulnerable to the following problems if a disaster or significant workforce disruption affects your operation:
Key representatives from a cross-section of the organization need to be involved in creating a BCP. This may include outside vendors. Whether you're developing one from scratch or want to evaluate an existing BCP, make sure it contains the following at the bare minimum:
Your BCP must be readily accessible to the management team at all times and should be rehearsed at least quarterly with regular audits for possible improvements. Testing should include failover of all critical assets to ensure that the failure of a single asset or multiple assets cannot cause a prolonged outage.
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
NOC teams must measure service quality and provide quality assurance continuously or risk damaging customer satisfaction and compromising the NOC's reputation. Effectively and consistently executing a runbook (i.e., processes and procedures) is paramount to meeting a NOC's service level requirements.
The detailed monitoring of key network and IT assets and services is core to meeting these objectives. This monitoring, data collection, and correlation—typically accomplished using a variety of protocols and tools—is the entry point into incident handling and problem management processes.
Other sources of data include calls and emails (among still others). The NOC runbook, created during onboarding and updated regularly, is key to what follows next. Documenting agreed-upon processes and procedures for the specific customer environment provides the NOC team with an essential operational reference.
A good quality control program monitors and measures primary aspects of your NOC service via its KPIs. These KPIs provide much-needed visibility into NOC support activity, responsiveness, and effectiveness. NOC management can use this information to ensure, for instance, that stated objectives for event-to-action times and first-level incident resolution are being met for each customer.
Quality control also detects chronic issues so management can find appropriate solutions—for example, correcting relevant runbook procedures, ensuring complete documentation is available to the NOC, or providing additional staff training.
A monthly audit of a subset—say, 10%—of all tickets created is an important part of an ongoing review. Staff mentoring is also key to quality control and helps ensure high levels of customer satisfaction.
A quality or service assurance program enables your NOC to identify and resolve problems before they significantly impact customers or the business.
A quality assurance review begins when a customer reports dissatisfaction with any aspect of the NOC service. NOC management follows up with an internal review of the service, evaluating responsiveness metrics, adherence to runbook procedures, customer interaction, and technical troubleshooting, to name a few.
Such quantitative and qualitative measures and the resulting feedback lower the chance of the same problem recurring. Monthly and quarterly reviews of the service with stakeholders ensure that customer expectations remain met.
One of the most common challenges in NOC operations is the disconnect between SLO compliance and customer satisfaction. A NOC might be meeting all its documented service level objectives while customers remain unhappy with the service.
This disconnect often stems from:
An effective customer experience management program addresses these gaps by:
|
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
NOCs operating at peak efficiency can receive and process alarm or event information from multiple sources and present it in a consolidated view for staff to act on. This consolidated view is commonly called a "single pane of glass."
Most NOCs need to bring voice, email, text, customer portals, knowledge bases, documentation, and workflow management tools into the NOC—each potentially with its own platform. Without proper integrations connecting these tools and platforms, NOC personnel are faced with tracking and managing multiple screens for event information, manually collecting information from multiple sources for documentation, notification, and escalation, and then attempting to manage workflow toward service restoration.
This makes monitoring and reporting on SLA metrics nearly impossible, let alone optimizing performance. The results inevitably include operational inefficiencies, missed SLAs, and undue stress on staff.
The core components of an integrated NOC platformA comprehensive NOC platform should integrate the following components. Read about our own platform for a deeper dive. Alarm Monitoring
AIOps Engine
Ticketing System
Communication Systems
CMDB and Knowledge Management
Reporting and Analytics
|
A well-maintained Configuration Management Database (CMDB) is the heart of an effective NOC platform. It enables:
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
Documentation is essential to a NOC's ability to function well over the long term. This process includes building runbooks, documenting workflow processes, creating structured databases for storing and retrieving information, and recording business results for analysis and optimization.
Too often, however, services are added, or changes are made without proper documentation to support them. This limits the ability of the NOC to resolve an issue when it arises—wasting time and creating avoidable risks.
Poor documentation often stems from a lack of resources and the expertise required to map out processes and create work instructions and documents. Instead, key people simply "know what to do" and new staff learns by "seeing and doing" alongside an experienced mentor.
NOC teams also often overlook performance metrics that can be obtained from network and monitoring systems, ticketing systems, and back-office tools. These metrics are critical for analyzing performance, predicting failure, and laying the groundwork for ongoing quality control and process improvement.
Without an understanding of alarm activity, ticket activity, and common causes for outages and trends, management is limited to responses that are reactive and tactical, rather than proactive and strategic.
Essential documentation for NOC operationsA comprehensive documentation strategy should include: Runbooks
Knowledge Base
Network and System Documentation
Process Documentation
Performance Metrics and Reporting
|
Beginning with the service catalog, document the tools and procedures needed to deliver NOC services successfully. Technical writers can often be invaluable in this process!
Documentation is only valuable if it remains accurate and up-to-date. Implement processes to ensure documentation stays current:
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
A NOC's scalability is a measure of its ability to handle a growing amount of work without compromising the level of service. Typically, business plans include initial funding, sales and marketing, system build-out, operations support, and the business guidance needed to meet the projected growth. What business plans sometimes don't consider is predictable growth and process planning.
Often, for example, sales for a young company take off, with key managers focused on new clients and getting technical services delivered to meet service launch dates. The same technical and operations resources are then tasked with the ongoing support of these services—severely impeding the organization's ability to manage its growth. The result is predictable: customer dissatisfaction.
The ability to grow or absorb expansion requires careful consideration of the following factors:
StaffingIt's very important to measure the staff utilization percentage derived from various NOC activities (described in our second best practice). Keeping this below 80% enables your NOC to absorb growth while allowing enough lead time for recruiting additional resources. Scalable staffing models include:
|
Systems and networkA distributed redundant architecture allows for systems to grow and expand. The ability to easily deploy additional server resources enables you to handle sudden spikes in growth. The performance of the systems and network (bandwidth, CPU, memory, etc.) needs to be monitored closely to make sure there is enough capacity to handle growth. Scalable system architecture should include:
|
ToolsTools used by the NOC (e.g., monitoring tools, ticketing systems, knowledge base) to deliver the service must have additional capacity built into them to handle the projected growth. It's not uncommon for tool performance to suffer dramatically if tools aren't designed for growth resulting in service-level degradation and a loss in productivity. Consider these factors for scalable tooling:
|
Process standardization and trainingA consistent process framework and methodology for delivering high-quality service is one of the key features of a scalable NOC. Management should choose and adopt a process standard that fits their product and industry needs. NOC staff can then be trained to follow the established company standards. Scalable process implementation includes:
|
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
There are several components that make up the cost of running a 24/7 NOC. When budgeted appropriately, these items combine into a powerful investment that has the potential to deliver a value that far exceeds its cost.
StaffThe staff required to support a 24x7 NOC include not only front-line engineers, but also back-end support groups such as systems and network engineering, service transition, human resources, and customer advocacy. To adequately budget for staffing, consider:
|
TrainingResources need to be allocated for training NOC staff when they are initially hired, when onboarding new customers, and whenever changes are made to existing support or new technologies are introduced. Training budget considerations should include:
|
Quality assuranceAn objective quality assurance program is needed to address customer concerns and maintain service-level agreements. Quality assurance budget items typically include:
|
Systems, networking, and securitySystems, network connectivity, and security controls need to be deployed in either data centers or the cloud to house the various tools and applications required by the NOC to operate. Resources for ongoing support need to be included. Make sure to budget for:
|
Software licensingA NOC requires various tools for monitoring, troubleshooting, and resolving issues. These include network and element management systems (NMS/EMS), trouble ticketing systems, knowledge bases, portals, and configuration management databases (CMDBs). A few software budgeting considerations not to ignore:
|
Infrastructure and facilitiesA physcial NOC must be designed and maintained to enable smooth workflow and communication among staff. Redundancy and business continuity are essential to mitigate risk. Facility budget items include:
|
All of these components present a formidable operating expense but have to be considered in building a successful NOC. Too often, NOCs are built considering only a subset of the above components, and as a result, they struggle to scale and deliver on the required service and financial objectives of the organization.
For many organizations, the total cost of ownership (TCO) of building and maintaining an in-house NOC is prohibitively expensive. Outsourcing NOC services to a specialized provider can often reduce TCO by 50% or more while providing access to mature operations and advanced capabilities that would take years to develop internally.
When evaluating the build vs. buy decision, consider not just the direct costs but also:
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
For years, top-tier NOCs have started applying automation to repetitive, low-risk tasks that distract technical specialists from more important (and frankly more exciting) work. Only more recently, however, have NOCs started arming themselves with vastly better data processing and machine learning power to augment and replace more complex manual tasks traditionally handled by humans.
Perhaps the most impactful recent advancement is AI-driven event correlation. NOCs can now let machines correlate event data much faster than humans ever could and identify the subtle indicators of approaching issues within a torrent of otherwise noisy data. The outcome can be measured in significantly faster and more proactive response rates—and thus, happier customers and end-users.
Here are the major ways we've implemented AIOps within our NOC to unlock dramatically better efficiency and performance:
Event monitoring and ManagementAIOps can aggregate data from multiple data sources and multiple technology areas across the entire enterprise and provide a central data collection point. It can then analyze this data quickly and accurately to determine when multiple signals across multiple areas indicate a single issue. The resulting reduction in alert noise brings into focus those alerts that require action and helps reduce Time to Impact Analysis and thus reduce Mean Time to Repair. Specific benefits include:
|
Incident managementAIOps can feed analysis into the Incident Management process by autonomously surfacing the probable cause and allowing the NOC engineer to confirm that the analysis and data are sound before implementing a resolution plan. The result? Dramatically faster incident analysis. AIOps enhances incident management through:
|
Auto-resolution of short-duration incidentsOne particularly powerful capability is the automatic resolution of transient issues. When alarms clear quickly after triggering, the system can perform automatic checks and resolve the ticket without human intervention—while still documenting what happened for future analysis. This automation includes safety mechanisms that prevent auto-resolution after multiple rapid recurrences, ensuring that flapping services get proper attention. |
Problem managementWhile the goal of Incident Management is to restore service quickly, Problem Management determines the root cause and finds a permanent solution to avoid the same incident in the future. Root cause determination is typically resource-intensive, requiring hours of event and log data analysis. AIOps transforms problem management by:
|
Change managementMaintenance events are common in the NOC. An effective AIOps implementation allows automatic suppression of alarms when an infrastructure or application maintenance event is recorded. Automation will then only create tickets if appropriate after a maintenance window has been completed. AIOps enhances change management through:
|
Here are just a few snapshots of specific results we've been achieving for our clients:
These statistics represent not just efficiency gains but real business value in terms of reduced downtime, improved service quality, and better utilization of skilled resources.
As machine learning models continue to evolve and improve with more data, we anticipate even greater capabilities in the future:
This combination of automation and machine learning brings the power and promise to genuinely transform how IT operations teams organize and operate. And as time goes on, automation will steadily continue to replace even more manual activities better suited for machines.
Given the complexity of integrating machine learning and automation into a NOC operation, there are no convenient action items to point to here; such an undertaking, while transformative, requires a ton of highly-specialized work and significant investment — more than what would make sense for most teams to build internally.
Here at INOC, our clients simply inherit the powerful AIOps capabilities we've spent years and many resources building and refining. The core of our alarm and event management system is our AIOps engine, which utilizes machine learning to automate low-risk tasks and extract actionable insights from the vast amounts of data gathered across clients' supported environments.
Our AIOps tools correlate alarms from multiple sources, perform deep inspection of those alarms, and enrich them with additional metadata from our CMDB to expedite informed action. Importantly, you can seamlessly integrate your existing infrastructure monitoring system with our AIOps toolset to retain the systems you already use while further maximizing those investments.
This integration feeds alarms from your NMSs (or ours if needed) into our platform, streamlining alarm correlation, enrichment, and automatic ticket creation. After a ticket is generated, our platform automatically identifies and attaches CIs from our CMDB, giving NOC engineers clear direction for investigation before any human intervention. The platform also supplies relevant knowledge articles and runbooks, facilitating fast, accurate diagnosis and action plan development.
This best practice addresses the following problem indicators. Talk to us to explore a NOC solution if you're experiencing any of them:
|
While it's easy to talk about best practices, it's another thing entirely to bring those practices to life within your organization. Success requires careful planning and care, which is why expertise is so critical at the outset of building or optimizing your NOC.
Here at INOC, we help organizations with these critical needs through award-winning outsourced NOC support (sometimes referred to as NOC as a Service) and NOC operations consulting services.
Our NOCs monitor tens of thousands of infrastructure elements around the clock. High-level NOC management expertise and custom-built systems ensure you and your customers achieve the infrastructure performance and availability needed to grow and thrive no matter how your IT environment evolves or what new challenges arise. By following an operational methodology that utilizes a tiered support structure in full alignment with the ITIL framework, our NOC can rapidly respond to incidents and events and continue to implement changes as needed, all under a more cost-effective service model.
Our service provides:
We also deliver comprehensive best practices consulting for designing and building new NOCs and helping existing NOCs significantly improve the support provided to you and your customers. Our approach to high-quality support aligns and integrates each function of NOC support operations to enable more informed, consistent decision-making in line with the ITIL framework.
Our consulting offers:
Want to learn how to put these best practices to use in your NOC? Contact us or schedule a free NOC consultation with our Solutions Engineers to see how we can help you improve your IT service strategy and NOC support download our free white paper below.