Data center customers expect perfection. Whether you are an in-house department serving a CIO/CTO or a wholesale data center and co-location site, your customers expect perfection when they house their servers and other equipment in the center.

What exactly is perfection in the eyes of today’s data center customer?

  1. The power stays on.
  2. Commands and data go in.
  3. Mission-critical information flows out.
  4. Items 1-3 above happen 24 hours a day, seven days a week, 365 days a year … forever.

Beyond the elementary expectation that your data center will operate as expected, today’s business practices and new regulations like HIPPA data rules demand it.

In a presentation at the Data Center World Conference this spring with my colleagues Don Byrne, PhD, and Jack Pyne, we characterized meeting this “always on, never faltering” customer expectation as tantamount to avoiding the “Black Swan.”

Avoiding the ‘Black Swan’

Data centers may be at risk for what risk engineer and high technology philosopher Nassim Nicholas Taleb calls a “Black Swan” — where the impossible can and will happen.  The problem, Nassim explains in his book, is that “we place too much weight on the odds that past events will repeat.”

Really important events are rare and unpredictable. He calls these events Black Swans, which refers to a 17th century philosophical thought experiment. In Europe, all anyone had ever seen were white swans; indeed, “all swans are white” had long been used as the standard example of a scientific truth. So what was the chance of seeing a black one? Impossible to calculate or at least they were until 1697, when explorers found black swans (Cygnus atratus) in Australia.

 

RISK: When What Can Never Happen — Does from TechPoint

 

Being aware of more than basic risks, looking beyond the obvious and planning forward to avoid the Black Swan event, is critical to data center management.  Typical data center risks include:

  • Unlicensed software
  • Home-grown code in critical path
  • Single carriers/utility providers (no provider diversity)
  • No policy/guidance for controlling Bring-Your-Own-Device (BYOD)
  • Rogue wireless access points
  • Local purchasing leading to a lack of configuration control
  • Inaccurate change management tracking
  • Out-of-date documentation
  • Changing compliance requirements with rules/standards/laws
  • Unnoticed facility flaws (e.g., internal wooden frames)
  • ‘Sandbox’ projects using actual client data for testing
  • No data governance software

Using Recognized Standards To Avoid Risks

Many day-to-day issues and Black Swan events can be prevented by implementing recognized standards in the data center operations. Recognized standards are important to the data center risk management strategy for these reasons:

  • Standards avoid, mitigate or accept risk
  • Standards can assist in identifying risk
  • Visible standards assure agencies, clients and stakeholders that you’ve have managed risks appropriately
  • Confidence
  • Communication

Standards and RCM Help Avoid the Black Swans

Reliability-Centered Maintenance (RCM)  was developed by the FAA and the airlines in the 1960s, adopted by the US Military in the 1970s; incorporated by the nuclear power industry in the 1980’s and is even used by Disney theme parks.

RCM is risk assessment and management on steroids because it reaches to equipment component levels, and is formalized in SAE JA1011.  SAE JA1011 sets the minimum criteria any process should meet before it can be called RCM.

FMECA: Failure Mode, Effects, and Criticality Analysis

To follow SAE JA1011 the Failure Mode, Effects, and Criticality Analysis process is a bottom-up, inductive analytical method which can be performed at the functional or piece-part level. It includes criticality analysis, and charts the probability of failure modes against the severity of their consequences.

FMECA is the starting point for process review. FMECAs are reviewed, refreshed, and maintained at least annually. The collected data is incorporated into an ongoing and dynamic failure probability analysis model.

LifelineChart
Sample: FMECA Chart Analysis can help to avoid data center equipment failures. (Note: Fictitious data for illustration only.)

 

When evaluating and purchasing data center infrastructure equipment such as generators, UPS systems, and HVAC systems, you should demand copies of the FMECAs from the manufacturer or distributor.

Who should care about risk?

Everybody really, but as data center operators we find ourselves more and more working directly with corporate- and enterprise-risk managers. Client risk managers are becoming more adept in RCM, failure probability analysis, and the associated value to the risk assessment and risk management equation.

It is important to our enterprise as risk-free service providers to constantly demonstrate through word and action, using RCM processes and techniques, that your data and system is as secure as technologically possible.


Rich_BantaRich Banta
Co-owner
Lifeline Data Centers, Indianapolis

Rich is responsible for compliance and certifications, data center operations, information technology, and client concierge services. Rich has an extensive background in server and network management, large scale wide-area networks, storage, business continuity, and monitoring. He is formerly the chief technology officer of a major health care system. Rich is hands-on every day in the data centers.