THE ROOT CAUSE OF ALL EVIL
Automatic fault correlation in action makes for quite an impressive demonstration.
Industry News
Blogs
Briefing Room
advertisement
As if out of nowhere, a torrent of minor, major and critical network alarms flood the monitors of an operations center, freezing technicians and turning the heads of supervisors. Behind the scenes, directors are paged and top account managers excuse themselves from conference calls. Printers puke. Bells gong in remote offices as technicians look up from their overnight trunk reports (or their fantasy football results).
Then, like the wind dying down from a passing tornado, alarms begin to clear as they scroll up instead of down and turn green instead of red. And it's over—nothing left but the repairs.
One's first thought upon watching a network almost instantaneously diagnose itself is that artificial intelligence really works. That would be wrong, of course, because there is nothing at all artificial about it. It's all thresholds and hard-won experience coded into software. One's second thought is that it would take a group of technicians hours to sift through the alarms and alerts flying at them in the form of cryptic, color-coded messages from thousands of network elements.
The truth is, it would take years. In fact, it already has. According to John Chandler, director of national network service assurance at Verizon, it took several years after the first digital switches were installed to get the correlation of alarm streams to the point where the results could be trusted and were most useful to operations center technicians. And that was when all that network operators had to watch were switches, trunks, power, a bit of signaling and a few peripherals. They also had to be on the lookout for some open-door alarms, and maybe some temperature spikes.
“You learn as you go,” Chandler said. “Correlation is only as good as the database you use, and it takes time for technicians to work with vendors to tell them what we want to see and how we want to see it.”
Typically, that meant waiting for something to break and analyzing the messages that were generated—then applying thresholds to some, inhibiting others and raising or lowering the severity levels based upon how critical the event was determined to be by the carrier, not the designer.
Eventually, carriers and vendors tweaked their management systems as much as they could.
“For the most part, the systems we use do more than an adequate job,” said Sid Conley, director of network planning and engineering for BellSouth. “I don't know of any additional enhancements that would drive quicker trouble resolution.”
Unfortunately, at least for those technicians ensconced in their comfort zones, the network continues to change.
Demand for new services flooded the network with new technologies: Sonet, dense wavelength division multiplexing, ATM, frame relay, xDSL, E-911 and IP—even voice mail. With that came new gear: gateways, hubs, concentrators, controllers, add/drop muxes, core and edge routers, application servers and programmable cross-connects.
More network elements meant more alarms and more points of failure, each one capable of generating thousands of alarms for nothing more than a hiccup. What's more, the messages were strange, illogical and had little to do with up or down, crossed or shorted, synchronized or not. A frame-out-of-sequence message meant squat to a switchman. And 500 of them meant 500 times squat.
“It's a new world and a new type of alarm,” Chandler said. “We have told the vendors it is mandatory to correlate these alarms. They have to be able to tell us whether we have a traffic problem or if it's just an impairment, and what condition it is in.”
Vendors have responded with sophisticated software systems, which they say can perform real-time automatic fault correlation as well as root cause analysis—determining the actual first cause of a failure. The fact that two of the nation's four RBOCs have never heard of software tools for doing root cause analysis says the marketing teams have a thing or two to learn themselves. So we look at the realities of fault correlation.
Still, the best tool for identifying trouble in the network and determining its root cause is a well-trained technician. “It's hard to replace a 20-year veteran,” Chandler said. “But if you do lose one—and in these times we're losing a lot of them to retirement—alarm correlation is almost a necessity for the people that only have a couple of years of service.”
The younger the work force, the more correlation you need, Chandler said. “But it takes the old timers to get the correlation right.”
That's not a comment on the capabilities of a new generation of network troubleshooters; it's one of the realities of a multi-technology network. “People graduating from college with technical degrees are not learning what we did 40 years ago,” Chandler said. “Theirs is a whole new world, and a switch guy is pretty much a switch guy.”
To make sense of all the messages generated by network elements, both equipment makers and network management software providers must go back to their drawing boards.
“In the old world of the digital switch, they didn't tell me I had ones and zeros—they told me I had an alarm and what it meant,” Chandler said. “In the packet world, I don't want to know I have three packets instead of ten. They have to tell me what that means.”
One problem with correlating alarms in a packet world—a mixed blessing, really—is that thanks in part to OSMINE and the carriers' own strict performance criteria, most new equipment works quite well. If it takes years of real-world faults to determine the best way to correlate alarms, carriers could fall victim to the Maytag principle: It's hard to get experience fixing things that seldom break.
Despite having as many as 30 systems generating and managing alarms in the network, carriers are on the prowl for solutions that will take the cost and complexity out of monitoring their networks. “We are always looking for opportunities to build reliability and efficiencies into our network,” said BellSouth's Conley.
The catch? “It's the payback period,” he said. “If we have to invest $100 million and we have a discount payback period of five years, that's not a good thing to do.”
That's one reason why BellSouth will be joining Verizon and SBC in a vice president-level forum beginning next month to meet with top vendors and discuss ways of driving cost out of monitoring and maintaining their networks.
In addition to finding ways of bringing meaning to the information overload generated by next-generation equipment, the forum will discuss how to lower the cost of managing the stragglers left over from the last generation. One possibility might be to pool the human resources within the RBOCs that maintain soon-to-be-obsolete (but not soon enough) systems such as the 1ESS, primarily to reduce costs, but also to free the vendors to better support new systems.
“We are all trying to stay in the same business and make everyone happy,” Chandler said. “Maybe this is one way we can help each other get away from some of the stuff we still have to watch.”
In the meantime, carriers still need to better correlate their alarms. They have relied, as most of us have when things got tough, on MOM — perhaps the most aptly applied acronym in networking. It stands for manager of managers. The MOM is often responsible for correlating alarms from many competing vendors' equipment as well as several network probes, performance measuring devices and other fault management systems — in essence, making peace between brothers.
That's hasn't always been easy. Vendors tend to hold back on delivering fault information to a management system run by its competitor.
“Most people would rather reign in hell than serve in heaven,” said Robert Vetter, president of network operations software at Lucent Technologies.
However, that's changed over the last couple of years as vendors have started living up to their claims of providing open systems. As Vetter said, “We can have no more Chinese walls. Systems [in Lucent's case, the Navis iOperations Software manager] have to be multi-vendor.”
As both Chandler and Conley have stated, it also must tell them what they want to hear. Vetter concurred. “Having a preference for how we want customers to use our systems is counterproductive,” he said.
Lucent has been investing in and developing a more robust mediation layer for its Navis system that promotes multi-vendor correlation and the automatic incorporation of new network elements.
Nortel Networks also has heard the call. “We publish our northbound interfaces for any third-party OSS system to take our feeds,” said a Nortel representative. “We continuously work at refining the definitions of fault message content to facilitate correlation of fault information.”
Verizon will be deploying a dozen or so Nortel Succession switches over the next year as it begins to transport voice over ATM and IP. Chandler warns that design engineers will have to do a much better job of identifying what alarms mean what.
Engineers at Nortel say that the Succession switch generates only about 5% more alarm messages than the typical DMS Class 5 switch. That includes alarms generated by associated media gateways, signaling gateways and audio services. The integrated element management system from Nortel provides carrier fault creation based on customizable threshold criteria that gives carriers control over those faults.
Given enough years, they just might come to understand them. Then, maybe, someone can sell them on a nice root cause analysis solution.
Want to use this article? Click here for options!
© 2012 Penton Media Inc.
advertisement
Learning Library
Webcasts
Using Real-Time Offers, Alerts and Interactions To Improve the Mobile Broadband Experience
In this Webinar you will learn how to create a real-time relationship with your customers, how to proactively improve the customer experience, and how to successfully target and cross-sell services to boost incremental revenue.
- Megabytes to Megabucks, Bandwidth to Business Models: How 4G Is Changing Everything
- How to Unplug Your Redundant Telco Apps To Save Money and Improve Efficiency
- When IaaS Isn't Enough: Service Provider Business Models to Drive Growth and Build Margin
- How to Transform Your Aging Telco Voice Network to Drive New Profits and Revenue
- Creative Licensing Approaches for Telcos & Their Network Equipment Vendors
- Smart Home Opportunity: Balancing Customer Data & Privacy
White Papers
The Role of Diameter in All-IP, Service-Oriented Networks
This paper discusses the rise of Diameter and benefits of Diameter Protocol.
- Conducting The Orchestration – Order Management at the Speed of Business
- Toward a Converged Network Edge
- Beyond Spam – Email Security in the Age of Blended Threats
- 6 Important Steps to Evaluating a Web Filtering Solution
- The Expertise to Protect You from Botnet and DDoS Attacks
- Seeing is Believing – Bridging the Order Visibility Gap
Featured Content
A time and money saving approach to fiber deployment
Service providers are under tremendous pressure to turn up new services faster then before and, at the same time,
to do it at less expense - and intra-office fiber is one of the biggest challenges in terms of both cost and service
turn-up.
of interest
The Latest
News
From the Blog
Briefingroom
Join the Discussion
Resources
Get more out of Connected Planet by visiting our related resources below:
Connected Planet highlights the next generation of service providers, as well as how their customers use services in new ways.
Subscribe Now







