The Roots of Reliability
No matter where, when, or how long, network outages make news. They are the nightmare of all network service providers. An outage means immediate loss of sales and revenue. It also can mean loss of customers, a steep decline in market capitalization and lasting damage to the provider's reputation.
Industry News
Blogs
Briefing Room
advertisement
The cause of a given outage can be as obvious as a lightning strike or as arcane as a glitch in software code. But often, it can be traced to a flaw in basic infrastructure: the electric power and climate controls that support critical computer equipment. What shuts down a network at a critical time may be a remarkably low-tech problem:
-
A power failure traced to an uninterruptible power supply with too little redundancy
-
A server room that ran too warm because an improper rack configuration blocked the flow of conditioned air
-
An inadequately trained employee who tripped the wrong circuit breaker during a maintenance inspection
Providers that quickly build out critical infrastructure can place themselves at considerable risk if they fail to implement systems that ensure uninterrupted service. Reliable critical equipment depends on sound engineering and operating standards at each step of development, including site selection, design, construction, commissioning and steady-state operation.
Critical facilities: A brave new world
Critical facilities in business are far from new: Telephone companies, banks, credit card merchants and brokerages have owned vast computer networks for years. What has changed is the speed of deploying technology.
Since the Internet emerged as a communication channel and a place to shop for goods and services, demand for critical facilities has exploded. This growth profoundly affects the telecommunications, Internet service and fiber optics industries. New and existing companies, jockeying for market position, rush to build new critical facilities using speed and budget as their main criteria.
However, far different factors drive critical facilities development in more established companies. To insulate against risk, these providers expand their networks according to a plan, applying strict engineering, operations and maintenance standards. Support personnel are experienced and technically capable. A combination of in-house staff, architects and consulting engineers are constantly on the lookout for network trouble.
That stability and discipline are often missing in growth companies. Firms in emerging and hotly competitive sectors such as wireless communications tend to staff up quickly, hiring people from multiple organizations. In addition, the speed-to-market imperative greatly accelerates the build cycle.
At the same time, cost pressures dictate the maximum productive use of floor space. Equipment configurations may violate principles of temperature and humidity control and may fail to account for future maintenance needs. Facility design and operating practices may lack standardization across locations.
Most important, growth may outstrip the capability of standby electrical power, power quality protection and air-conditioning systems. While the occasional power or climate-control problem may not seriously harm a traditional business, it can devastate a network-based firm, especially one that has lured customers with guarantees of 100% reliability.
The anatomy of risk
Service providers' critical facilities must closely regulate temperature and humidity and their rate of change. Operating outside manufacturers' specifications can increase mean time between failures in servers and other computer data storage and communication devices. Low humidity increases the risk of damaging electrostatic discharge.
In recent years, manufacturers have made computer equipment somewhat more temperature- and moisture-tolerant, but 68o to 72o F and 40% to 60% relative humidity are still considered ideal conditions in critical facilities.
Critical equipment rooms are typically cooled through sub-floor plenums that deliver conditioned air under pressure through a system of floor vents. In a well-designed system, cooling capacity is sized to the equipment's heat-rejection requirements, and air pressure and flow patterns are carefully managed to deliver the necessary cooling to each area of the room. The room design recognizes manufacturers' requirements for "white space"-open areas around racks that enable front, back or side access for replacement and service.
Problems occur when companies strive to maximize revenue potential per square foot of space outside the context of a sound facilities plan. When servers and other devices are added without considering the effects on air conditioning and ventilation, the critical environment can slip out of control.
Often, personnel building critical equipment rooms ignore white space requirements, installing equipment in any open area. Besides increasing heat load, the added equipment requires more electrical cable cutouts in the floor. These allow the uncontrolled escape of chilled air, reducing system pressure and impeding delivery of air to points where it is most needed. Furthermore, improper alignment of equipment racks can interrupt air flow, causing "hot spots." In extreme cases, equipment cooling requirements exceed the capacities of chilled water piping, ducts, condenser loops or cooling towers.
The net result is loss of the ability to cool equipment properly over an extended period. Equipment runs above manufacturer-recommended temperatures or, just as serious, its temperature fluctuates beyond manufacturers' rate-of-change specifications. This inevitably accelerates wear and in the long run increases the risk of failures.
Ill-planned construction also can affect electric power reliability. For example, improperly organized circuits may create phase imbalances large enough to cause an uninterruptible power supply (UPS) to trip offline. At bare minimum, a level of power-protection redundancy is lost and the risk of a system outage is magnified.
Cases in point
A properly designed system and properly trained personnel are keys to effective critical environment control. Two recent examples illustrate how the risk of failure increases when either component is lacking.
Case 1: Flawed design. A large, publicly traded company was experiencing climate-control problems in a 15,000-square-foot computer network facility. As the company installed critical equipment in the raised-floor room, temperatures were slipping outside specifications in multiple areas. The company commissioned an analysis of the climate-control system, which found that:
-
The 15-year-old condenser loop was providing insufficient heat transfer because of severe tube scaling caused by inadequate water treatment
-
Air conditioning equipment was running at full capacity with multiple equipment problems-yet no alarms were being produced
-
Improper alignment of server racks impeded the return path of air to the in-room air-handling units, causing hot spots
-
Obstructions in the sub-floor lowered the available sub-floor air pressure to near zero in certain areas, preventing delivery of chilled air
In general, the facility was at significant risk of a failure that might have caused a system outage affecting large numbers of customers and possibly leading to widespread adverse publicity.
The company corrected the problem by upgrading cooling equipment, moving the air-handling units so that return air flowed perpendicular to the units, removing sub-floor obstructions, and introducing a process to manage floor cutouts. The modifications made temperatures in the room far more uniform.
Case 2: Training deficiencies. Studies of power quality document that eight out of 10 power outages are caused not by the initial event (such as a lightning strike or dig-in fault) but by how systems or personnel respond.
An ISP faced a potential crisis in its call center after one of two UPSs in a redundant system tripped offline. Immediately after the event, operating personnel called for service. A service technician went to the circuit breaker to disconnect the failed UPS. Without consulting the as-built drawings, the technician threw the breaker labeled for the failed UPS. In reality, the labeling was incorrect, and the technician's action cut the utility power feed to the operable UPS.
That UPS went into alarm, reporting to the building automation system that its battery was discharging and that a shutdown of the protected systems was imminent. No one noticed the alarm. When battery power ran out, a major outage occurred, affecting all terminals in the call center.
The incident could have been avoided if labeling of electrical circuits had been correct, if the technician had not assumed the labeling was accurate, and if facility personnel had been properly trained and drilled in handling such events.
Toward long-term reliability
Reliable critical facilities depend on supporting infrastructure built with discipline and foresight. An effective critical facilities plan accounts for four key components: people, process, systems and technology.
Building for reliability is a five-step process, each of which requires coordination among multiple functions. Companies that design, engineer and construct the facilities must consider the needs of those who will eventually commission, operate, service and maintain them.
Site selection. Construction cost per square foot is a key business consideration, but needs such as security, power quality, fire protection and structural integrity may easily outweigh the benefits of low-cost real estate.
Companies planning critical facilities should consider local earthquake potential, lightning strike frequency, crime levels and, most important, electric service reliability. Another key consideration is the frequency of power outages. Sites near foundries or other operations with large electrical loads require careful planning for power-quality protection.
Design. Facilities must be designed around specific standards that cover equipment and material specifications, code compliance and installation practices. These standards should be consistent from location to location. Critical considerations include:
- Cooling capacity.
Cooling equipment-chillers, piping, duct capacity, headers, condenser loops and cooling towers-must be sized for the heat rejection needs of the fully built facility with allowance for system losses.
- Air flow.
Equipment must be arranged to prevent air-flow obstructions, which can create warm and cool spots. Server racks should be aligned parallel to the direction of air flow. Cable cutouts and other floor openings must be minimized to prevent misdirection of air and loss of air pressure. Sub-floor air-flow restrictions can be reduced with cable management systems that direct electrical cables into troughs. Such problems can be nearly eliminated with overhead electrical cabling.
- Maintainability.
Systems must be configured so that maintenance can be performed without sacrificing redundancy. For example, air conditioners should be connected in parallel, not in series.
- Standardization.
Use of the same materials, devices and technologies across all sites helps reduce costs in the design, construction and operating phases.
Construction. Facility buildout must follow accepted best practices and must adhere to all applicable design standards. When possible, systems and equipment should be tested to verify that they meet design specifications. For example, UPS systems should be factory-witness tested under load. Cooling systems should be tested on-site. Tests should also be conducted for continuity in grounding systems, on links between generators and transfer switches, and on power cable insulation.
Commissioning. Before start-up, design specifications and construction standards require verification by an outside source hired by and accountable to the facility owner. The entire power system must be confidence-tested. Commissioning also includes the generation of critical documents: as-built drawings, general maintenance plans and training procedures. These, along with equipment specifications, warranties and call numbers for service, should be placed in a reference library.
Steady-state operation. Sound operation, maintenance and contingency plans should guide day-to-day activity. Plans should include an escalation policy specifying whom to call for different levels of alerts or alarms. Critical spare parts should be identified, sourced and stored on-site as required.
Whenever a change in critical equipment is proposed, appropriate personnel should carefully evaluate the effects on the environmental control system. If the change is made, the provider must update as-built drawings, maintenance procedures, training programs and other facets of the operations plan.
A change-management policy should guide any necessary adjustments to operating procedures. To protect "tribal knowledge" held by long-time staff, personnel should be cross-trained and succession plans should be in place for key staff members.
Without exception, telecom companies are best served by designing,
building and maintaining critical facilities according to a
comprehensive, integrated plan.
John W. Sawyer is director of mission-critical facility services for
Johnson Controls,Milwaukee. His e-mail address is john.w.sawyer@jci.com.
Want to use this article? Click here for options!
© 2012 Penton Media Inc.
advertisement
Learning Library
Webcasts
Using Real-Time Offers, Alerts and Interactions To Improve the Mobile Broadband Experience
In this Webinar you will learn how to create a real-time relationship with your customers, how to proactively improve the customer experience, and how to successfully target and cross-sell services to boost incremental revenue.
- Megabytes to Megabucks, Bandwidth to Business Models: How 4G Is Changing Everything
- How to Unplug Your Redundant Telco Apps To Save Money and Improve Efficiency
- When IaaS Isn't Enough: Service Provider Business Models to Drive Growth and Build Margin
- How to Transform Your Aging Telco Voice Network to Drive New Profits and Revenue
- Creative Licensing Approaches for Telcos & Their Network Equipment Vendors
- Smart Home Opportunity: Balancing Customer Data & Privacy
White Papers
The Role of Diameter in All-IP, Service-Oriented Networks
This paper discusses the rise of Diameter and benefits of Diameter Protocol.
- Conducting The Orchestration – Order Management at the Speed of Business
- Toward a Converged Network Edge
- Beyond Spam – Email Security in the Age of Blended Threats
- 6 Important Steps to Evaluating a Web Filtering Solution
- The Expertise to Protect You from Botnet and DDoS Attacks
- Seeing is Believing – Bridging the Order Visibility Gap
Featured Content
A time and money saving approach to fiber deployment
Service providers are under tremendous pressure to turn up new services faster then before and, at the same time,
to do it at less expense - and intra-office fiber is one of the biggest challenges in terms of both cost and service
turn-up.
of interest
The Latest
News
From the Blog
Briefingroom
Join the Discussion
Resources
Get more out of Connected Planet by visiting our related resources below:
Connected Planet highlights the next generation of service providers, as well as how their customers use services in new ways.
Subscribe Now







