In addition to design and infrastructure, emergency response plans and mock drills I have explained in my previous posts, a highly-structured and robust maintenance program is crucial in preventing a disaster from impacting business.
Your data center services provider should have a Computer Maintenance Management System (CMMS) to keep track of when maintenance is due as well as repairs that have been done. This also helps to identify equipment that has numerous repairs and may require replacement before it reaches end-of-life. Only through a regularly scheduled preventative maintenance program performed by OEM representatives can you be assured that a data center is prepared for a disaster.
For example, batteries are a weak point in any system and, if not monitored and maintained properly, can actually cause an outage during a loss of utility power. Real-time monitoring can help by not only reporting when the batteries fall out of the OEM specifications, but by performing load testing to ensure the UPS can support the critical load. Although many providers perform quarterly maintenance on their batteries, that isn’t always enough – batteries can, and often do, fail shortly after scheduled preventative maintenance.
The way in which maintenance on critical equipment is planned and executed is also extremely important. For example, a “critical environment work authorization program” ensures that each element in the maintenance procedure is reviewed not only by the local facilities engineering team but also by a committee consisting of engineering staff and management across the enterprise. Maintenance on critical equipment should only be performed when the provider has 100% confidence in the “method of procedure,” the contractors performing the work, and the documented contingency plans. You should also request to see maintenance records, including associated repairs, to ensure your confidence in the provider’s ability to prevent and predict needed maintenance.
Predictive maintenance is as important as preventive measures regarding end-of-life equipment replacement decisions. Once again batteries – specifically their timely replacement – are a perfect example. In this case, you should ask about the age of the UPS batteries, what the OEM recommended life expectancy is, and when your data center provider plans to replace the batteries. Even well-maintained equipment will eventually reach an end-of-life cycle, which could lead to a catastrophic failure if there is not proper predictive planning for replacement.
For IP network maintenance, ask your data center services provider how long their equipment has been in service, when the last failure was and how recently its software has been updated. How do they monitor the health of the devices? Do they monitor device logs proactively or primarily react to events that occur? Do they maintain certain devices in the network differently than others, and if so, why? How do they react to impactful software bugs that are found? What is their QA process to validate new software and/or configurations before deploying these to the network?
Stay tuned for more on data center disaster preparedness in our next segment on communication best practices for data center providers.