Platform outages are not an un-common phenomenon in the present digital world. In the last few years we have seen outages of all magnitudes with many reputed corporations. In mid-2013, when Amazon website went down for approximately 30 minutes, the calculated average revenue loss was $66, 240 per minute, totaling to about $2 million (Forbes, 2013).
Many a times data centers and the infrastructure become targets to blame for platform outages. Infrastructure is merely a set of non-intelligent objects such as servers, switches, cables, hubs, etc. The real issue, perhaps lies with the people and processes that manages the data centers and infrastructure.
Is the infrastructure given enough care to avoid such platform outages? Let’s have a closer look at ways to minimize or possibly avoid platform outages.
Are regular health checks being performed on the infrastructure?
Periodic health checks on a component level helps minimize the impact of an outage in a big way. Numerous tools are available in the market today for monitoring and alerting purposes. However, measures have to be taken to ensure that the monitoring mechanisms implemented are fool-proof and alerts are configured to notify the right people at the right time with the right information.
Is the code free of run-time errors?
In one of our experiences, the code ran into an infinite loop when a specific operation was performed in the website. The operation caused an overflow of memory stack, ultimately crashing the website.
The solution was simple, but the website down-time was significant due to the time took to identify the issue. Proper code reviews and vulnerability tests performed before every production release helps avoid such circumstances.
What back-up options are available in case of an outage?
Along with normal fail-over mechanisms, having a secondary environment is essential for an online platform to mitigate potential outages. Many organizations maintain a separate disaster recovery (DR) environment to handle outages.
Using a production-like test environment and re-purposing it as a DR environment during an outage is a practice adopted by several other organizations. Some others maintain a local environment for temporary use to minimize down-time.
Regardless, just having a DR environment wouldn’t help; periodic DR drills needs to be performed to ensure that the environment behaves as expected during an outage.
The challenge obviously is to maintain the same version of applications in all the environments, however, it’s worth the money spent rather than losing revenue and credibility with a disastrous outage.
To summarize, cautious efforts needs to be taken at both infrastructure and application levels to minimize the impact of platform outages. Even though newer technologies such as cloud and as-a-service platforms abstract the outage scenarios from their customers, I believe, the above precautions holds good to such environments as well.
Thank you for the time taken to read this article. I would also like to thank my colleague Sanjay Menon for sharing his experience and providing valuable inputs to prepare this article.