itinfraworld

Making IT Infrastructure Self-healing for Enterprises

By Kiran Gollu, Co-founder/CEO, Neptune.io

Kiran Gollu, Co-founder/CEO, Neptune.io

A key benefit of today’s cloud providers like AWS, Rackspace, and Google Cloud is how it enables quick deployment of highly decoupled and distributed applications so that each architectural tier can scale independently. While this increases agility and scalability of applications, it requires each architectural component to be monitored, and that leads to proliferation of oncall alerts that need to be examined and fixed 24x7. Therefore a lot of engineers across companies wake up in the middle of the night to routinely fix oncall alerts, spending at least an hour on average. Addressing these alerts in a timely manner is important, as application downtime related to them could mean loss of revenue and customer goodwill. For example, Amazon.com loses roughly 2 million dollars for every half-hour of outage. In fact, in 2013 alone, businesses lost 26 billion dollars on IT outages.

"Companies need to minimize IT downtime and make servers and applications self-healing"

With the advent of Amazon EC2, we’ve tens of virtual machines per physical server. Similarly, we’ll continue to see hundreds of Docker containers per physical server. The scale of number of virtual servers and number of applications that are deployed on these virtual servers has changed by more than two orders of magnitude. To make things worse, companies are deploying more third party SaaS based solutions, thus leading to more distributed deployments, which in turn cause more alerts. In the next 3-5 years, dealing with alerts manually will become almost impossible. Therefore, it is mandatory for DevOps teams to think about self-healing services as first-class citizens so that they can focus on the other critical aspects of the business.

In fact, 40 percent of all the IT infrastructure alerts can be fixed automatically. For an additional 30-40 percent of the alerts, downtime can be minimized significantly by providing more context and diagnostics surrounding the alert. Thus, it is critical for organizations to deal with 70-80 percent of alerts automatically, so that they don’t have to deal with undifferentiated heavy lifting of dealing with manual IT operations.

Today, there are Software-as-a-Service platforms that help DevOps teams to diagnose and fix IT infrastructure issues automatically, becoming the first line of defense for your DevOps engineers. They directly integrate with existing monitoring and alerting tools like NewRelic, AppDynamics, AWS CloudWatch, and Nagios, and automatically run corrective or diagnostics actions as soon as an alert is raised from these tools. This mitigates the manual intervention needed by DevOps engineers in fixing IT infrastructure issues. Even if they intervene, engineers have all the diagnostic information they need to root cause and fix the underlying issues swiftly. Large enterprises such as Amazon, Netflix, and Facebook built proprietary auto-remediation tools to fix monitoring alerts.

"We want to bring auto-remediation technology used by Amazon, Netflix, and Facebook to everyone"

One of the key requirements today is that services from companies like ourselves need to be significantly more reliable and available since we are taking corrective actions when our customers are experiencing outages. Firms like Neptune.io leverage Amazon Web Services extensively to deliver this—with services like Amazon DyanmoDB, we could offer single-digit millisecond latencies (predictable) and tolerate an AWS data center outage. With Amazon SQS and SNS, we were able to deliver fault tolerant store-and-forward agent-based architecture for our customers. Further, we found SQS long polling techniques to be extremely helpful and it has helped in delivering highly scalable and production-ready version of our product in 3x less time than anticipated.

“It is critical for organizations to deal with 70-80 percent of alerts automatically, so that they don’t have to deal with undifferentiated heavy lifting of dealing with manual IT operations”

Most of existing monitoring and alerting tools like AppDynamics, NewRelic, DataDog, PagerDuty, and VictorOps only inform engineers about underlying application infrastructure problems. They either do not fix them or offer only preliminary corrective actions. “We are not trying to be another monitoring or alerting platform; instead, we are laser-focused on building an auto-remediation platform for running corrective actions automatically after these tools raise an alert.”

Neptune.io advocates the use of a self-healing platform for Amazon that automatically manages thousands of servers. “Now, we are making such a platform available for everyone. We are integrated with AWS CloudWatch, NewRelic and are rolling out more PagerDuty, DataDog, and AppDynamics integration shortly so virtually every startup, and enterprise can leverage self-healing capabilities of our service.”