Critical Outage on www.commcarehq.org
Incident Report for CommCare HQ
Postmortem

Overview

Our primary operational databases suffered from a rapid increase in storage, which resulted in a CommCare HQ outage that prevented users from being able to access the system. Although no data was lost or corrupted during the incident, this prevented normal operations for all our users.

Summary of the incident

At approximately 20:20 UTC on Jan 6th 2021, we were alerted of a rapid increase in storage on one of our primary operational databases. Our engineers determined that this was related to a server storage auto-scaling event. The auto-scaling event led to the disk rapidly becoming full. As a result, the machine couldn’t be accessed at the database level to remove data manually, and since the database was hosted through a managed RDS server, no other control interface was available to attempt manual recovery steps. Since this database is common to the vast majority of CommCare HQ core functionalities, we took the site down for safety and consistency, making CommCare HQ unavailable.

Our engineering team immediately responded with a plan to restore our systems to their normal state, which involved working with our hosting provider to restore the primary operational database that had experienced the rapid increase in storage. At 05:40 UTC the service maintenance was complete and all services were restored.

Our Next Steps

Dimagi has doubled the available storage capacity for our database processing and added new protocols for increasing capacity in urgent situations. We have also lowered the monitoring alert threshold to notify us sooner, providing more time to respond.

We understand that our users expect a positive user experience at all times and we are sorry for the inconvenience this caused. Thank you for your patience and support. Please reach out to support@dimagi.com if you have further questions on the incident.

Posted Jan 19, 2021 - 17:57 UTC

Resolved
Dear users, all services on www.commcarehq.org have fully been restored. Our engineering team will continue to monitor all services closely. In case you experience any issues, or have any questions, please reach out to the Dimagi Support Team. We apologize for the inconvenience that has been caused to all users. Our team will post a full postmortem of this incident to provide an update for all our users affected by this issue. We thank you for your patience while we worked to restore the services, and appreciate your understanding regarding this incident.
Posted Jan 07, 2021 - 06:40 UTC
Monitoring
Dear users, www.commcarehq.org is now operational. All services have been restored. Our teams are closely monitoring the situation, and we'll provide a further update on the status of the services in the next 30 mins. We appreciate your patience and understanding during the time we have been working to resolve the problem.
Posted Jan 07, 2021 - 06:23 UTC
Update
Dear users, we are continuing to work on the outage with highest priority. We sincerely apologize for the inconvenience that is being caused due to this. We will have an update in the next 30 mins or as the status changes.
Posted Jan 07, 2021 - 05:57 UTC
Update
Dear users, our engineering team is continuing to address the outage with the highest priority. We sincerely apologize for the inconvenience that is being caused due to this outage. We will provide an update in the next 60 mins (or as the status changes) regarding the progress of the issue. We thank you for your continued patience during this time.
Posted Jan 07, 2021 - 04:55 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 07, 2021 - 02:04 UTC
Update
The engineering team is still working hard to resolve our current issues. We will post another update within the next 60 minutes or as the status changes. We thank you for your continued patience.
Posted Jan 07, 2021 - 02:03 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 07, 2021 - 01:51 UTC
Update
The engineering team is still working hard to resolve our current issues. We will post another update within the next 60 minutes or as the status changes. We thank you for your continued patience.
Posted Jan 07, 2021 - 01:51 UTC
Update
Our engineering team continues to implement a solution to address the issues with the system. We will post another update within the next 60 minutes or as the status changes. Thank you for your continued patience.
Posted Jan 07, 2021 - 00:46 UTC
Update
Our engineers are making progress. We will post another update within the next 60 minutes or as the status changes. We thank you for your continued patience.
Posted Jan 06, 2021 - 23:46 UTC
Update
Our engineers are still hard at work to resolve the issue and continue to make progress. We value your patience while working to get CommCare online.
Posted Jan 06, 2021 - 22:43 UTC
Update
We continue to treat this issue with the utmost urgency and we are working hard to implement the solution. Thank you once again for your continued patience.
Posted Jan 06, 2021 - 22:08 UTC
Update
Thank you for your continued patience. Our engineers are making progress to resolve the issue and will continue to provide an update on the situation.
Posted Jan 06, 2021 - 21:47 UTC
Update
Our engineers are continuing to work on implementing the solution and we will continue to provide an update on the situation. We sincerely appreciate your patience.
Posted Jan 06, 2021 - 21:10 UTC
Identified
Dear users,
We have identified the source of the issue and our engineers are working to implement a solution. We will provide an update on the situation within the hour. We sincerely appreciate your patience.
Posted Jan 06, 2021 - 20:39 UTC
Update
Our engineering team is continuing to investigate this issue and we will provide an update once we have more information.
Posted Jan 06, 2021 - 20:33 UTC
Investigating
Dear users, we are experiencing an issue on CommCare HQ and our engineering team is looking into this with urgency. We will provide an update as soon as we have more information. We apologize for the inconvenience caused.
Posted Jan 06, 2021 - 20:27 UTC
This incident affected: www.commcarehq.org (Form Builder, App Release Management, User Management, Syncs with CommCare HQ, Form Submissions, Data Exports, Project Reporting, Report Builder, SMS / Messaging, Bulk Case Import, CommCare Data Export Tool, CommCare APIs, Data Forwarding, Web Apps, App Preview, Data Cleaning, Report an Issue).