According to Ponemon Study, the average cost of unplanned data center outage is $9000, while the maximum downtime costs $2,409,991.
Increasing Importance of Data Centers
In today’s IT environment, data centers play a major role in storing and securing the business critical data, enabling to further access it whenever required. Today, as data centers have become more crucial for enterprise IT operations, it is equally important to keep them running efficiently. With the proliferation of data and growing concerns over energy consumption, environmental impact and operational expenses involved in running a data center, IT managers are looking ways to avoid additional costs raised by data center outages.
Data center outage, if measured thoroughly, impacts more than a company’s budget. It can damage organizational data, critical equipment attached to the data center, productivity of the company and also the brand’s reputation. Thus, companies should be prepared and initiate strategies to improve the regular availability of data and reduce downtime risks.
Why Downtime matters for an Organization?
The security and reliability of components in the LAN and data center space is important to the success of businesses. As the world becomes more connected, any failure in the network leads to downtime and this can cause a wide range of revenue loss consequences.
Modern businesses depend on online communication and services, where downtime of a communication network/server can actually result in loss of productivity. For companies who conduct substantial business online and display their products on a website, downtime can lead to loss in purchase opportunities, which in turn results in loss of revenue. Data loss due to outage is a nightmare for many companies, as it allows potential data exposure and creates opportunities for cyber attacks. Also, and it can lead to uncertainty among customers resulting in losing the customer base.
Root Causes of Data Center Outages
Data centers require a high-level of reliability and there are a variety of internal and external factors that can pose a threat to availability, leading to outage.
Below are the leading origins of unplanned downtime.
1.Rack Level PDU
PDU is very critical component in data center bad quality. PDU and Power Cords, if not given higher priority can definitely lead to a disaster. PDU failure, MCB Trip and Loose connections are the general problems in PDU.
2. Failure in UPS systems
UPS failures remain the most common and primary reason for outages. These failures account for one-fourth of overall unplanned data center outages. They can stem from battery/equipment failure or an excessive power draw beyond the UPS capacity.
3. Cyber Crime Activities
Cyber attacks are viewed as a top cause of unplanned downtime. It also is one of the fastest growing causes of data center outages. In the past few years, these attacks are recorded from just 2% of outages in 2010 to 22% in 2016. Data center operators must take suitable actions to establish early detection and mitigation systems, which can prevent the cyber attack.
Mistakes are made when people are involved in the organizational processes. It is the simplest cause yet hardest to avoid. Experts recorded the unplanned outages caused by human errors brings down to 22 percent of total outages in 2016.
5. Heat Related Failures
Computer Room Air Conditioning (CRAC) failures have also increased as more infrastructure is involved in the modern data center. Most of the cooling systems were not designed to serve the purpose of increased density in a packed data center.
Thus re-working on cooling systems is required to bring down the heat related failures in a data center.
6. Weather Related Failures
Natural disasters and weather events made up 10 percent of outages. Proper data center cabinets need to be considered to safeguard the data center from natural calamities.
100% Uptime: Additional Investments will definitely do
Most all of the above mentioned outages are completely preventable and in many cases, the cost to prevent the problem was insignificant compared to the direct and indirect cost of the outage.
Below are the basic tips to counter data center outage
1. Making the DC rack Intelligent
By adding Intelligent PDU & Environmental Monitoring, we can definitely minimize risk of Over Loading/ Tripping. PDU have Power data on real time and further automation is possible to create policy and alerts on PDU / socket level. With socket level monitoring and implementing proper policies, we can also further eliminate the risk of one SMPS/Server putting down the complete Rack. Using Lock Power cords will avoid accidental Power Cord Pull over & loose Contacts for Server outage!
Environmental monitoring on rack level in line with ASHRAE guide line will give real time information on cooling parameter and we can mitigate the cooling issues to a greater extent. By installing Intelligent Locking, we can eliminate the risk of rack physical security at rack level. Monitoring the Lock status on real time blocks the unauthorized access.
By installing proper asset management system, we eliminate the risk of new equipment going to wrong rack which in turn creates power and cooling issues.
2. Finding the right partner for DCIM solutions
For maximum availability and increased productivity, consider the best DCIM vendor in the market who can provide performance optimization and data center assessment services.
3. Regular Monitoring UPS Batteries
Batteries are the sensitive components in the UPS system. A single bad cell in a string can put your entire backup power system down. Use remote battery monitoring to identify battery problems before they impact operations.
4. Switch to Lithium-ion Batteries
Lithium-ion batteries require less maintenance and service. They are specially designed specifically for UPS applications. They are smaller, lighter and last longer than traditional valve-regulated lead acid (VRLA) batteries while providing the power needed for critical loads.
5. Include Thermal controls with Cooling units
An important part to reduce downtime is ensuring that you have right cooling components to match the load demand. These cooling units improve protection by monitoring component data points, providing unit-to-unit communications, matching airflow and capacity to room loads, automating self-healing routines, providing faster restarts and preventing hot/cold air mixing during low load conditions.
6. Perform Preventive Maintenance
Environment factors such as moisture, humidity may lead to corrosion of components and leads to power failure. Steps should be taken to ensure the data center is clean and should perform preventive maintenance to ensure the data center is running efficiently. Servicing the needed components and upgrading them can increase life and efficiency of data center infrastructure.
7. Proper training to the Technicians
As human errors are ranked as the leading cause for the downtime, proper ongoing communications to the technicians is essential. Updating the policies and procedures regularly to make the team aware of common threats and training on how to respond to system failures is a key tip to counter downtime.
8. Standardize and Automate Security Management
Use console servers to provide secure, remote access to servers to simplify patch management and provide early detection of attacks.