Overview
Our primary operational databases suffered from a rapid increase in storage, which resulted in a CommCare HQ outage that prevented users from being able to access the system. Although no data was lost or corrupted during the incident, this prevented normal operations for all our users.
Summary of the incident
At approximately 20:20 UTC on Jan 6th 2021, we were alerted of a rapid increase in storage on one of our primary operational databases. Our engineers determined that this was related to a server storage auto-scaling event. The auto-scaling event led to the disk rapidly becoming full. As a result, the machine couldn’t be accessed at the database level to remove data manually, and since the database was hosted through a managed RDS server, no other control interface was available to attempt manual recovery steps. Since this database is common to the vast majority of CommCare HQ core functionalities, we took the site down for safety and consistency, making CommCare HQ unavailable.
Our engineering team immediately responded with a plan to restore our systems to their normal state, which involved working with our hosting provider to restore the primary operational database that had experienced the rapid increase in storage. At 05:40 UTC the service maintenance was complete and all services were restored.
Our Next Steps
Dimagi has doubled the available storage capacity for our database processing and added new protocols for increasing capacity in urgent situations. We have also lowered the monitoring alert threshold to notify us sooner, providing more time to respond.
We understand that our users expect a positive user experience at all times and we are sorry for the inconvenience this caused. Thank you for your patience and support. Please reach out to support@dimagi.com if you have further questions on the incident.