For cloud-based products outages are an occasional headache at best, and at worst a cataclysmic churn-producing nightmare. Even if your company hasn’t had a major system-wide outage, it’s best to have a framework on hand for when you need to spring into action. Looking for an example of an outage workflow for when your SaaS product is down? Scroll to the end.
Determining your SaaS product is down
Establishing a process for investigating potential issues will move responsibly away from a single team or from a long winded Slack back and forth along the lines of “is this also broken for you”? Empower anyone from Support to Finance to escalate an issue to be reviewed by someone technical. From there someone will need to determine- is this a real outage?
Define ahead of time what the key functions of your app are that cannot be down. Having a clear set of guidelines will make it clear to everyone on the team that something critical is down.
For a SaaS product these functions could be the ability to log in, to create a new project, to ensure changes are auto-saved, to capture and access reporting, etc. A designated group of employees should weigh the severity of each of these key functions being down and use that sentiment to dictate the appropriate response. Are users unable to login? Notify all your users immediately via email and Twitter. Reporting being captured but incorrectly displayed? Perhaps a note on the UI will do the trick while you investigate a fix.
Communicating with your customers
Be upfront. If you’ve determined something is wrong but don’t have a lot of details- just say so. Yes, you could incite a flood of additional questions from confused customers, but it’s better to cop up to the fact that something’s amiss than to pretend that it’s not. Otherwise you will be in a reactive position of responding to customers who are frustrated by the issue reaching out to your team. If you’re not sure what kind of information you should put in a status update, this article provides a useful framework. Below are a few key components for handling a crisis.
One of the best strategies I’ve found to deal with user requests for additional information is to set expectations around when your team will next have an update. You can decide what’s reasonable; at my company we decided every hour or two was an acceptable cadence during business hours. We made a requirement that every communication (internal and external) stated when the next update would be sent. This gave our customers an expectation of when they would hear from us next, and also let them know that we were actively working to address the issue. You’d be surprised how far that goes towards garnering some goodwill during a high-stress time for both parties.
How far should you broadcast this message?
You should consider how widely you want to broadcast your proactive warning that you are experiencing issues. Is it possible to keep news of this outage contained so that you’re only notifying affected users (and not the whole world) about your shortcomings? Are there client points of contact who should be notified before the others?
Review what happened
When everything is back to normal there are a few final things to consider before you can move on. First, review if this outage was a violation of any of your SLAs or contractual agreements. Second, issue an internal and external post moratorium to break down why things went south. The internal recap can and should be more critical than that shared with your customers. The external recap should focus on what steps you’re taking to address the root causes of the outage. I recall my surprise at how understanding my customers were during a 48+ hour systemwide outage once I explained the cause of the problem and how we were ensuring we would not make the same mistakes again. Outlining how you’re going to make sure you don’t make the same mistakes again will help reinstil confidence.
A crisis will undoubtedly harm your reputation, and acknowledging that you messed up is an important initial step towards rebuilding trust.
Example of outage workflow when SaaS product is down
At a previous company we decided to formalize our outage workflow into a flowchart in the vein of a discover your own adventure book. We started with the basics: who discovered the emergency- Business or Engineering?, and drilled down to role of specific teams and individuals depending on how the issue progressed. Some keys for reading the flowchart below: CS= Customer Support; ER= Emergency Room channel on Hipchat; AM= Account Management.
I’m keen to hear your thoughts on the process below and what kind of process your team has created to deal with when your SaaS product is down.