Outage
Incident Report for On-Site.com
Postmortem

Overview

On Friday, March 17, we experienced a site-wide outage from 3:52–6:30pm PDT. Our engineering team spent the majority of the outage time identifying the source of the problem, which we discovered was a network file storage system. Although we have extensive monitoring systems in place, the issue with the network file storage system was not immediately detected. As a result, our investigation focused on secondary errors before arriving at the root cause. We are making corrective adjustments to our monitoring of this system, so that any future issues can be diagnosed more rapidly. Additionally, we are enhancing our ability to communicate with clients during a major outage such as this one.

Next Steps for Communication:

Next Steps for Engineering:

  • Adjust monitoring to the network file storage service that experienced issues
  • In the event of a complete outage, automatically display a page indicating our service status, directing users to the new status page for updates
  • Investigate additional measures to prevent cascading failures, such that the majority of the site remains functional
Posted Mar 21, 2017 - 18:43 PDT

Resolved
We are now operational again.
Posted Mar 17, 2017 - 18:30 PDT
Investigating
We're aware of a site-wide outage. Details to follow in postmortem.
Posted Mar 17, 2017 - 16:13 PDT