When SCORM Cloud outages happen, we drop everything until service is restored. A few of us go into “let the world know that we’re working on it” mode, and the SCORM Cloud developers glue their eyes to their screens and wrists to their desks until the outage is resolved. That’s what happened yesterday.
SCORM Cloud relies on a few third-party systems to do what it does. Yesterday, a problem was exposed by an outage of one of these systems and SCORM Cloud services were unreliable for about an hour.
Typically if a server is unavailable, it will send a “server unavailable” message that we handle appropriately and with no SCORM Cloud outages. Yesterday, the server wasn’t sending anything to SCORM Cloud, not even a “server unavailable” message.
We didn’t have anything in place to handle this particular instance, so SCORM Cloud kept trying to connect to the server, opening too many connections and causing failures.
We now have code in place so that if the server ever stops responding again, SCORM Cloud will handle it as a “server unavailable” situation — one that we can handle with no problems at all.
Our developers don’t have any direct control over how others handle their servers and systems so we can’t directly prevent errors. What we can do (and what we’ve done) is make it so that SCORM Cloud is able to handle a wider range of errors. This translates to more SCORM Cloud up time.