Solutions to help your business Sign up for our newsletters Join our Community
  • Share

Preventing Network Nightmares

A potential nightmare lurks in your network system. You can't see it coming, you can't prevent it from paralyzing your system, and you probably don't even want to think about it, but network failure can strike at any time. Outages not only cause chaos and confusion, but also they can cost wireless carriers, vendors and customers millions of dollars.

More on this Topic

Industry News

Blogs

Briefing Room

Pro-active maintenance and preventive testing help scare away network gremlins, but there is no guaranteed method of ensuring a flawless network. If you want to sleep at night, you need to prepare for the worst and remain ever vigilant, watching over your network 24 hours a day, seven days a week (24/7).

As telecommunications carriers move toward more software-driven services, network-management and monitoring systems have become a necessity. But those systems didn't warn AT&T before two frame-relay switches brought down its entire 145-node network. AT&T's nightmare resulted from a problem be-tween two nodes that spread quickly to bring the network to a standstill. Fortunately for AT&T, the outage affected only the frame-relay network. But the outage didn't spare thousands of its customers.

According to Frank Ianna, executive vice president of network and computing services, most of AT&T's customers use a combination of frame-relay services and private-line services as well as dial backup or ISDN backup for many of their systems. But the outage forced many to rethink their backup procedures.

"You can envision any set of circumstances all the way to providence that would cause a network to fail, and we just will be unable to overcome every circumstance," Michael Arm-strong, AT&T CEO, said at a press conference.

Network Complexities Although AT&T decided to forego all charges to customers for frame-relay service until it identified, isolated and determined how to fix the problem, this didn't relieve fears of a similar failure. With the increased complexity of network systems, outages are not anomalies, and reliability is not guaranteed.

As carriers move into advanced stages of networking, network management becomes critical in preparing for nightmares such as AT&T's. Roger Boulanger, Harris Corporation Network Support Systems vice president of technologies, said customers' demands are straining today's networks.

As service providers work to provide advanced features, network reliability is mission critical. Michael Robinson, Sprint PCS vice president of network operations, said carriers must become more determined to provide dependable service to match customers' increased expectations.

"Wireless is not this second line (for customers) -- in many cases, it is the primary line of communication," he said. "With that comes greater expectations ofreliability, and it is incumbent on us (carriers) to step up and provide that."

To keep up with customer requirements, carriers overbuild with new technologies, which then are combined with legacy equipment to create hybrid networks, and this complicates the system, Boulanger said. The integration of newer technologies with legacy systems is a challenge for carriers and vendors.

"Service providers try to anticipate by providing more technology before it is needed," Boulanger explained.

With such complex networks, it is not effective to manage a network using multiple, uncoupled network-management systems, he noted. "Seamless integration of management networks with reduced staffing is the biggest challenge," he said.

Boulanger said that new services such as the Internet and e-commerce have not had a significant impact on wireless carriers. What affects service providers more is their total reliance on networks, which are made up of many elements that may or may not fail at some point.

"Modern network elements tend to be more reliable than ever, (but) they are becoming more complex," Boulanger said.

Robinson agreed. As networks become more complex, carriers must not forget the fundamentals, he said. Carriers and vendors must continue to focus on the basics and not get "carried away with new complexities."

Considering all of these possible network complexities, vendors such as AvData Systems provide carriers with several network-system-support options because "wireless carriers need their networks 100% of the time," said Susan Cadwallader, AvData director of marketing.

Despite support systems, total prevention is not realistic.

"There's always a risk of something going wrong that was not anticipated," Boulanger said.

Dealing with Disaster Most wireless carriers do all they can to control disasters. There are many ways carriers and vendors can work to prevent problems including establishing excellent backup systems, performing pro-active testing procedures and maintaining strong relationships with one another.

The most important preventive medicines for many carriers are network monitoring and backup systems testing. For example, Bell Atlantic Mobile guards against network outage through redundancies and 24/7 monitoring from its network operations control center, said spokesperson Andrea Linskey.

"We can tell if a circuit goes out and monitor it, identify the problem and dispatch someone to the site immediately," Linskey said.

You can alleviate system outages by engineering, troubleshooting, preparing and monitoring your system, she suggested.

Monitoring alerts Bell Atlantic Mobile to the smallest problem, but Linskey said that problems cannot be averted completely; in February, one of Bell Atlantic Mobile's vendors experienced an outage that affected several carriers. "It was a T1 failure; mobile to landline was not working," she said. "And it happened from 1 p.m. to 4 p.m. in the afternoon, during peak time."

Bell Atlantic rectified the problem quickly, but Linskey said the carrier learned just how important precautions are in preventing failures.

"Remote monitoring is key," she said. "We've always had our network operations control center in place, but before (about one year ago), we had to fix and identify the circuit manually."

Sprint PCS also relies on monitoring and effective backup systems with its national network-management system in the Kansas City, MO, area. Robinson said the center is the hub of Sprint PCS' reliability maintenance process and serves as the "real-time eyes and ears" of the network, reporting all alarms.

Sprint PCS also maintains on-site support staffs that handle preventive maintenance and monitor weather patterns.

"We've found that one of the more common causes of outages has been power failures due to storm damage or other problems," Robinson explained. "Power failures account for almost 70% of our outages."

Because Sprint PCS cannot control Mother Nature, it operates backup generators and mobile base stations, which Robinson said can bemobilized within hours. As part of the carrier's national disaster preparedness plan, these mobile stations are placed in areas that frequently face weather problems.

The recent tornado that devastated Nashville, TN, tested the carrier's backup stations. Sprint responded to a damaged downtown cell with severe structural damage within hours. Robinson said it was difficult even to get to the equipment, so Sprint mobilized overlapping, adjacent sites to cover the affected area until the damage could be fixed.

"Our cells overlap to the point where we can lose a site, and it is transparent to the customer," he explained.

Robinson said the most effective way to protect your network is by planning ahead. "Anticipating the possibilities and putting recovery plans in place before problems is important," he said.

Sprint PCS works with federal and state agencies to ensure that in the event of a disaster, customers will not lose their wireless connections when they need them most. The team is continually evolving and changing, Robinson said, because "the disaster preparedness plan that works today probably will not serve customers' needs six months from now."

Patti Finley, AirTouch Cellular spokesperson, said her company's strategy is to do everything possible to maintain service amidst any disaster.

"We have redundant routing of calls through switches and the network, which is a self-healing network that is basically a backup for the interconnect; multiple switches in metropolitan areas; on-site backup generators at critical cell sites in case of a communications power failure; and detailed disaster-recovery plans for every market," she said.

AirTouch also offers a maintenance window, which means no customer impact during maintenance times or changes. Finley said maintenance is done on weekends or evenings. AirTouch works with vendors in advance to pretest any adjustments to ensure that changes meet its expectations before implementation.

The carrier relies on Hewlett-Packard's acceSS7 to monitor the SS7 network. AcceSS7 monitors an entire network simultaneously from a central point, not just one switch at a time, gathering real-time data. It gathers traffic statistics, analyzes SS7 protocols and provides call trace and alarm monitoring. It also gives network operators early warning of network degradation.

"With all of the different equipment, it allows us full visibility for all networks and allows us to report problems and troubleshoot," Finley said. "Such a monitoring system will become more important in the future, as we get into more intelligent networking advancements such as E-911 and local number portability."

Testing and Backup Basics Vendors also recognize that testing and maintaining good back-up systems is key to assuaging network nightmares. AvData Systems' Cadwallader said offering several options such as backup facilities, dial backup, a separate transponder on an alternate satellite and continuous equipment monitoring is critical.

AvData operates replicas for its wireless broadcast facility for backup, with hub facilities in Virginia and Atlanta. AvData also offers terrestrial (wireline) links to backup satellite links as well as satellite links to backup terrestrial ones. For customers with terrestrial links, diverse routing is designed into the network, enabling a customer to use AT&T as a prime carrier and Sprint as a backup, for example.

Cadwallader said that a situation such as the AT&T frame- relay failure wouldn't be a problem for many of AvData's wireless carriers because they broadcast their messages via satellite. In fact, AvData offers its customers an alternate satellite transponder, so they can broadcast off one satellite and also have a backup satellite carrier. The vendor provides built-in diverse routing, and customers have separate carriers whether they buy satellite backup or not. Cadwallader said 95% of AvData's clients have some form of host backup in place.

Most important, equipment is monitored 24/7. The vendor can tell if something degrades and determine equipment configurations, put spares in the field, and even correct minor problems within two hours.

But even with multiple maintenance plans, continuous testing is still an essential part of protecting networks, Cadwallader said. AvData schedules weekly testing for dial backup and monthly testing for the satellite paging backup hub for shared hub customers (shared paging terminal customers). AvData also offers backup testing plans.

"We consider provider networks to be higher availability networks -- they cannot ever be down. ... But there is a premium to be paid for that," Cadwallader said.

Because advanced networks bring more risks, protecting them is neither cheap nor easy, she admitted. AvData received a lot of inquiries after the AT&T failure because it provides design and ISDN backup for frame-relay customers.

"We have built-in redundancies, but things do happen," Cadwallader said. "We think backup is a very important and valuable part of any network plan."

But nothing is foolproof, including backup systems.

"There's not one fail-safe method, and even if you plan, if you don't test your network, it will fail," she said. "Routine testing is the key to success."

Harris' Boulanger noted that pro-active testing reduces trouble reports significantly. Harris, which offers pro-active and reactive testing, introduced a testing system at Supercomm that features pro-active, around-the-clock monitoring of entire networks.

"The system reduces maintenance costs, provides patterned test data on a regular/daily basis and can spot problems before they become problems," he said.

Testing is more important as carriers move into advanced stages of networking, Boulanger added. The uniformity of testing now is a critical issue, Boulanger said, because of the proliferation of new technologies and companies merging. Uniformity also is a concern because if test results are not uniform, the information is useless.

According to Boulanger, monitoring and testing are necessary to handle unexpected problems and prepare backup systems. Carriers often overlook the impact of common facilities failure, he said, and without effective remote monitoring, problems are ignored until they interrupt service. Equipment failure will not always affect traffic because the backup will kick in, but there is a chance that the backup won't function, which will result in outages.

And as networks become increasingly complex, Boulanger said, outages cannot be prevented totally, even though vendors and carriers strive to minimize the odds of a failure happening.

"Most important, carriers must have network-management systems that can diagnose the problem early on," he said. "If a portion of the network goes down, it is important to be able to minimize downtime."

Network-management systems must be able to redirect traffic immediately in the event of a failure. "Every network now has backup systems, but that's not really adequate," he explained. "Systems must be maintained and monitored regularly -- vigilance is the most effective (way to avoid outages). Knowing the condition of your network is critical."

Relationships Breed Reliability Another ingredient of a good prevention and maintenance plan is a strong working relationship with vendors. Sprint PCS works with its main vendors -- Lucent, Motorola and Nortel -- through ongoing system design to ensure network reliability.

"We work with them in staging equipment so that in rare moments such as disasters, there is a high probability we have equipment available," Robinson said. "We also can divert and replace staged base stations from our build and expansion program to restore service in a disaster-impacted area, if the extent of damage exceeds our cell-on-wheels disaster-response resource."

Finley said a few disasters have been averted because of Air-Touch's relationship with vendors such as Lucent, Motorola and Nortel. In 1990, the I-90 bridge in Seattle that links the east and west sides of the city sank. There was a tremendous change in the traffic on the network, and AirTouch had to work quickly with vendors to enhance the capability within a few days. Finley said AirTouch currently is preparing its networks for the increased capacity that will be necessary for the 2002 Winter Olympics in Salt Lake City, where traffic demands are expected to more than double.

Monitoring the Future Disastrous network outages may not be 100% preventable, and carriers and vendors may not be able to assure customers that network nightmares won't happen, but they can offer some network reliability, via pro-active maintenance, monitoring and testing.

One of the biggest challenges for network reliability, according to Robinson, is network diversification.

"With network architectures, there's an opportunity to distribute risks," he said. "Network diversification becomes more critical. It's an ongoing challenge for all of us."

Another challenge is maintaining network reliability for your customers.

"Even a 10-minute outage can hurt a carrier's brand and devalue its image," Linskey said.

After its frame-relay failure, AT&T's Ianna implied that outages cannot be stopped, but only contained.

"That's what network reliability is about," he said at a press conference. "Understand what can happen to the network and try to do the things that will instantly or very quickly detect it and minimize the impact if it were to occur. So it is prevention, detection and correction."

Wireless carriers that want to avoid AT&T's recent head-aches would be wise to follow his advice.

n May, PanAmSat's Galaxy 4 (G4) satellite, which transmitted up to 90% of the United States' paging signals, malfunctioned. The primary and backup computers required to keep G4 aimed at Earth stopped working because of an onboard control system and a backup switch failure, causing the 5-year-old spacecraft to spin out of orbit. Because the satellite was in orbit 22,300 miles above the center of the United States and provided equal access to any location, virtually all major nationwide paging carriers used the satellite for their primary direct satellite broadcasting systems. Providers had to scramble when the satellite failed and service was disrupted for tens of millions of people.

The outage reportedly affected 80% to 90% of the more than 48 million U.S. paging users. Robert Bednarek, PanAmSat president, said such an outage was unprecedented because the overall industry loss of satellites in orbit is less than 1% over the past five years. But that was no consolation for the 45% of users who rely on pagers for business or emergency services.

Service providers acted quickly to keep customers connected. PageNet, which serves 10.4 million paging users in the United States, employed its 9,500-transmitter paging network to send messages via a backup satellite. The transition required network adjustments on a city-by-city basis, with pagers returning to service as individual cities were reconfigured to the new satellite. Two days after G4's malfunction, 95% of PageNet's customers had service.

SkyTel used its landlines, normally used only to transmit replies, to transmit its 2-way network. The outage primarily affected paging customers' 2-way and advanced-messaging systems. According to Marc Kuykendall, SkyTel's director of corporate communications, because 2-way is new, there isn't much redundancy in place yet. But this probably will change as paging carriers re-examine their backup plans in light of the satellite crash.

AirTouch Paging restored service to 160 of its 170 U.S. markets, or 80% of its 3.l million paging customers, the day after the satellite malfunctioned. The provider moved its paging traffic to another satellite owned by PanAmSat, SBS6. Contracts with AirTouch's satellite vendor require the satellite company to provide this priority backup for such failures. AirTouch engineering personnel repositioned core sites in all cities to receive new signals from the backup satellite.

MobileComm technicians restored service to customers by engaging a backup satellite and rerouting all of its paging traffic. MobileComm repositioned its 2,500 base stations toward the backup G31 satellite to restore service hours after G4 went down. About 37% of MobileComm's 3.3 million subscribers were affected by the failure.

Spokesperson Krista Grossman said that after MobileComm found out about the malfunction, it decided not to wait for the satellite to be fixed but started switching its base stations to backup satellites immediately. Because of this decision, some customers were back on-line the night of the failure.

Jay Kitchen, PCIA president, said the failure served as a wake-up to the importance of paging services and wireless communications in our everyday lives. Paging providers handled the outage well, Kitchen said. Nearly 90% of customers were back on-line within 48 hours, which showed that most contingency plans carriers have in place do work and are sufficient, he said.

Want to use this article? Click here for options!
© 2013 Penton Media Inc.

Learning Library

Featured Content

A time and money saving approach to fiber deployment

Service providers are under tremendous pressure to turn up new services faster then before and, at the same time, to do it at less expense - and intra-office fiber is one of the biggest challenges in terms of both cost and service turn-up.

The Latest

News

From the Blog

Briefingroom

Join the Discussion

Resources

Get more out of Connected Planet by visiting our related resources below:

Connected Planet highlights the next generation of service providers, as well as how their customers use services in new ways.

Subscribe Now

Back to Top